David Lamparter [Sun, 7 Nov 2021 14:49:17 +0000 (15:49 +0100)]
tests: allow common_cli.c with logging enabled
common_cli.c disables logging by default so stdio is usable as vty
without log messages getting strewn inbetween. This the right thing for
most tests, but not all; sometimes we do want log messages.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
David Lamparter [Sun, 7 Nov 2021 14:41:18 +0000 (15:41 +0100)]
lib: fix c-ares thread misuse
The `struct thread **ref` that the thread code takes is written to and
needs to stay valid over the lifetime of a thread. This does not hold
up if thread pointers are directly put in a `vector` since adding items
to a `vector` may reallocate the entire array. The thread code would
then write to a now-invalid `ref`, potentially corrupting entirely
unrelated data.
This should be extremely rare to trigger in practice since we only use
one c-ares channel, which will likely only ever use one fd, so the
vector is never resized. That said, c-ares using only one fd is just
plain fragile luck.
Either way, fix this by creating a resolver_fd tracking struct, and
clean up the code while we're at it.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Donald Sharp [Sun, 7 Nov 2021 12:45:27 +0000 (07:45 -0500)]
tests: Remove debugs from topotests
Debugs take up a significant amount of cpu time as well as
increased disk space for storage of results. Reduce test
over head by removing the debugs, Hopefully this helps
alleviate some of the overloading that we are seeing in
our CI systems.
Donald Sharp [Fri, 5 Nov 2021 21:56:42 +0000 (17:56 -0400)]
ospf6d: Prevent crash in adj_ok
The adj_ok thread event is being added but not killed
when the underlying interface is deleted. I am seeing
this crash:
OSPF6: Received signal 11 at 1636142186 (si_addr 0x0, PC 0x561d7fc42285); aborting...
OSPF6: zlog_signal+0x18c 7f227e93519a7ffdae024590 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: core_handler+0xe3 7f227e97305e7ffdae0246b0 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: funlockfile+0x50 7f227e8631407ffdae024800 /lib/x86_64-linux-gnu/libpthread.so.0 (mapped at 0x7f227e84f000)
OSPF6: ---- signal ----
OSPF6: need_adjacency+0x10 561d7fc422857ffdae024db0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: adj_ok+0x180 561d7fc42f0b7ffdae024dc0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: thread_call+0xc2 7f227e989e327ffdae024e00 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: frr_run+0x217 7f227e92a7f37ffdae024ec0 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: main+0xf3 561d7fc0f5737ffdae024fd0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: __libc_start_main+0xea 7f227e6b0d0a7ffdae025010 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7f227e68a000)
OSPF6: _start+0x2a 561d7fc0f06a7ffdae0250e0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: in thread adj_ok scheduled from ospf6d/ospf6_interface.c:678 dr_election()
The crash is in the on->ospf6_if pointer is NULL. The only way this could
happen from what I can tell is that the event is added to the system
and then we immediately delete the interface, removing the memory
but not freeing up the adj_ok thread event.
Donald Sharp [Thu, 4 Nov 2021 17:00:51 +0000 (13:00 -0400)]
ospf6d: Prevent use after free
I am seeing a crash of ospf6d with this stack trace:
OSPF6: Received signal 11 at 1636042827 (si_addr 0x0, PC 0x55efc2d09ec2); aborting...
OSPF6: zlog_signal+0x18c 7fe20c8ca19a7ffd08035590 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: core_handler+0xe3 7fe20c90805e7ffd080356b0 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: funlockfile+0x50 7fe20c7f81407ffd08035800 /lib/x86_64-linux-gnu/libpthread.so.0 (mapped at 0x7fe20c7e4000)
OSPF6: ---- signal ----
OSPF6: ospf6_neighbor_state_change+0xdc 55efc2d09ec27ffd08035d90 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: exchange_done+0x15c 55efc2d0ab4a7ffd08035dc0 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: thread_call+0xc2 7fe20c91ee327ffd08035df0 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: frr_run+0x217 7fe20c8bf7f37ffd08035eb0 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: main+0xf3 55efc2cd75737ffd08035fc0 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: __libc_start_main+0xea 7fe20c645d0a7ffd08036000 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fe20c61f000)
OSPF6: _start+0x2a 55efc2cd706a7ffd080360d0 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: in thread exchange_done scheduled from ospf6d/ospf6_message.c:2264 ospf6_dbdesc_send_newone()
The stack trace when decoded is:
(gdb) l *(ospf6_neighbor_state_change+0xdc)
0x7bec2 is in ospf6_neighbor_state_change (ospf6d/ospf6_neighbor.c:200).
warning: Source file is more recent than executable.
195 on->name, ospf6_neighbor_state_str[prev_state],
196 ospf6_neighbor_state_str[next_state],
197 ospf6_neighbor_event_string(event));
198 }
199
200 /* Optionally notify about adjacency changes */
201 if (CHECK_FLAG(on->ospf6_if->area->ospf6->config_flags,
202 OSPF6_LOG_ADJACENCY_CHANGES)
203 && (CHECK_FLAG(on->ospf6_if->area->ospf6->config_flags,
204 OSPF6_LOG_ADJACENCY_DETAIL)
OSPFv3 is creating the event without a managing thread and as such
if the event is not run before a deletion event comes in memory
will be freed up and we'll start trying to access memory we should
not. Modify ospfv3 to track the thread and appropriately stop
it when the memory is deleted or it is no longer need to run
that bit of code.
Donald Sharp [Fri, 5 Nov 2021 15:49:37 +0000 (11:49 -0400)]
tests: pim_basic needs to wait for event to happen under load
The test system under load looks for upstream state only
1 time immediately after sending 2 streams of S,G data
flowing. Give the system some time to process this
and ensure that it actually shows up in a small
amount of time.
Donald Sharp [Fri, 5 Nov 2021 15:13:12 +0000 (11:13 -0400)]
tests: Ensure ospf has reconverged before continuing
The test_ldp_pseudowires_after_link_down test
shuts a link down and was blindly waiting 5 seconds
before just assuming the test system was in a sane
state. Remove the sleep(5) and actually look for
the changed state for the route 2.2.2.2 that the
psueudowire actually depends on.
Donald Sharp [Thu, 4 Nov 2021 15:45:27 +0000 (11:45 -0400)]
tests: Fix route replace test in all_protocol_startup
The route replace test was doing this seq of events:
a) Create nhg
b) Install route w/ sharpd
c) Ensure it worked
d) Modify nhg
d) Ensure the update group replace worked
The problem is that the sharp code is doing this:
/* Only send via ID if nhgroup has been successfully installed */
if (nhgid && sharp_nhgroup_id_is_installed(nhgid)) {
SET_FLAG(api.message, ZAPI_MESSAGE_NHG);
api.nhgid = nhgid;
} else {
for (ALL_NEXTHOPS_PTR(nhg, nh)) {
api_nh = &api.nexthops[i];
zapi_nexthop_from_nexthop(api_nh, nh);
i++;
}
api.nexthop_num = i;
}
The created nhg has not been successfully installed( or at least
sharpd has not read the results yet) when it gets the command
to install the routes. As such it passes down the individual
nexthops instead. The route replace is never going to work.
Modify the code to add a bit of sleep to allow sharpd to
get notified when the system is under load. At this point
there is no way to query sharpd for whether or not it
thinks it's nhg is installed properly or not. This
test is failing all over the place for a bunch of people
let's get this fixed so people can get running
Donald Sharp [Thu, 4 Nov 2021 12:01:14 +0000 (08:01 -0400)]
zebra: Send up ifindex for redistribution when appropriate
Currently the NEXTHOP_TYPE_IPV4 and NEXTHOP_TYPE_IPV6 are
not sending up the resolved ifindex for the route. This
is causing upper level protocols that have something like
this:
route-map FOO permit 10
match interface swp13
!
router ospf
redistribute static
!
ip route 4.5.6.7/32 10.10.10.10
where 10.10.10.10 resolves to interface swp13. The route-map
will never match in this case.
Since FRR has the resolved nexthop interface, FRR might as
well send it up to be selected on by the upper level protocol
as needed.
Rafael Zalamena [Tue, 2 Nov 2021 21:54:23 +0000 (18:54 -0300)]
bgpd: fix BFD configuration update on TTL change
When altering the TTL of a eBGP peer also update the BFD
configuration. This was only working when the configuration happened
after the peer connection had been established.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Philippe Guibert [Thu, 28 Oct 2021 16:28:42 +0000 (18:28 +0200)]
zebra: update dataplane flowspec address family in ipset_info
It is needed for the ipset entry to know for which address family
this ipset entry applies to. Actually, the family is in the original
ipset structure and was not passed as attribute in the dataplane
ipset_info structure. Add it.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Philippe Guibert [Thu, 28 Oct 2021 11:42:57 +0000 (13:42 +0200)]
zebra: fix flowspec ipset operations
When injecting an ipset entry into the zebra dataplane context, the
ipset name is stored in a separate structure. This will permit the
flowspec plugin to be able to know which ipset has to be appended with
relevant ipset entry.
The problem was that the zebra dataplane objects related to ipset entries
is made up of an union between the ipset structure and the ipset info
structure. This was implying that the two structures were on the same
memory zone, and when extracting the data stored, the data were incomplete.
Fix this by replacing the union structure by a defined struct.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Philippe Guibert [Wed, 27 Oct 2021 14:45:05 +0000 (16:45 +0200)]
ospf6d: avoid writing dumb ospf6 info at startup
in show: 'show ipv6 ospf6' handler command, the reason of SPF
executation is looked up and displayed. At startup, SPF has been
started, but shows no specific reason. Instead of dumping non
initialised string context, reset the string context.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Igor Ryzhov [Tue, 2 Nov 2021 21:29:19 +0000 (00:29 +0300)]
lib: fix crash when terminating inactive VRFs
If the VRF is not enabled, if_terminate deletes the VRF after the last
interface is removed from it. Therefore daemons crash on the subsequent
call to vrf_delete. We should call vrf_delete only for enabled VRFs.
Igor Ryzhov [Tue, 2 Nov 2021 20:54:43 +0000 (23:54 +0300)]
zebra: fix stale pointer when netns is deleted
When the netns is deleted, we should always clear the vrf->ns_ctxt
pointer. Currently, it is not cleared when there are interfaces in the
netns at the time of deletion.
If the netns is re-created, zebra crashes because it tries to use the
stale pointer.
Donald Sharp [Mon, 1 Nov 2021 19:08:50 +0000 (15:08 -0400)]
tests: All_protocol_startup sporadic failure
the test_nexthop_groups function is failing occassionally
because the test executes 4 in succession sharp install
routes commands. When I dumped the rib on a failed test
run there were only 2 of the 4 routes in the rib and
the two that were in were the last 2 installed.
The sharp daemon setups a event process where it
installs routes `automatically`. If the previous
run is not finished entering a new command to install
the routes will mess up the last one from ever happening.
It is assumed that the user doesn't do stupid stuff here.
In this case I am just adding a small sleep between each
installation to just let the test proceed.
Donald Sharp [Fri, 29 Oct 2021 17:03:42 +0000 (13:03 -0400)]
lib: Return Null when we have an empty string for script name
The script entries were being stored in a hash lookup with
the script name a pre-defined array of characters. The hash
lookup is succeeding since it is auto-installed at script
start time irrelevant if there is a handler function.
Modify the code so that if the scriptname is an empty
string "\0" just return a NULL so that zebra does
not attempt to actually load up the script
Donald Sharp [Mon, 1 Nov 2021 00:08:29 +0000 (20:08 -0400)]
tests: isis_topo1 needs to wait for results under load
the isis_topo1 test has two functions where immediately
after the test ensures that the routes are in isis
tests to see if they are in the rib. Under system
load I am seeing this test failing because the
routes are still queued. Modify the zebra check
for the isis routes to look for the proper results
for 10 seconds.
Tested with GoBGP (Helper):
```
long-lived-graceful-restart: advertised and received
Local:
ipv4-unicast, restart time 100000 sec
Remote:
ipv4-unicast, restart time 10 sec, forward flag set
```
Donald Sharp [Fri, 29 Oct 2021 14:21:28 +0000 (10:21 -0400)]
tests: Fix zebra_seg6_route to not always reinstall the same route
This code has two issues:
a) The loop to test for successful installation re-installs
the route every time it loops. A system under load will
have issues ensuring the route is installed and repeated
attempts does not help
b) The nexthop group installation was always failing
but never noticed (because of the previous commit)
and the test was always passing, when it should
have never passed.
Donald Sharp [Fri, 29 Oct 2021 12:47:05 +0000 (08:47 -0400)]
tests: zebra_seg6local has a race condition
The test is checking installing of seg6 routes by this
loop:
for up to 5 times:
sharp install seg6 route
show ip route and is it installed
The problem is that if the system is under heavy
load the installation may not have happened yet
and by immediately reinstalling the same route
the same thing could happen again.
Modify the code to pull the route installation
outside of the loop and to increase to 10 attempts
in case there is very heavy system load.
Olivier Dugeon [Mon, 25 Oct 2021 09:52:19 +0000 (11:52 +0200)]
lib: Fix comparison function in link_state.c
ls_node_same, ls_attributes_same and ls_prefix_same are not producing expected
result due to a wrong usage of memcmp. In addition, if respective structures
are not initialized with 0, there is a risk that the comparison failed.
This patch correct usage of memcmp and expand comparison to each invidual
parameters of the respective structure for safer result.
Donald Sharp [Thu, 28 Oct 2021 19:51:46 +0000 (15:51 -0400)]
tests: Fix `check_ping` function in test_bgp_srv6l3vpn_to_bgp_vrf.py
The check_ping function `_check` function was asserting and being
passed to the topotests.run_and_expect() functionality causing
it to not run the full range of pings if one failed the test.
So effectively it was properly detecting pass / failure but
only allowing for 1 iteration if it was going to fail.
Modify the code to not assert and act like all the other
run_and_expect functionality.
Igor Ryzhov [Thu, 14 Oct 2021 16:15:14 +0000 (19:15 +0300)]
lib: make if_lookup_by_index_all_vrf internal
This function doesn't work correctly with netns VRF backend as the same
index may be used in multiple netns simultaneously. So let's hide it
from the public API to reduce temptation to use it instead of writing
the correct code.
Donald Sharp [Thu, 28 Oct 2021 12:10:28 +0000 (08:10 -0400)]
zebra: Fix netlink RTM_NEWNEXTHOP parsing for nested attributes
With the addition of resillient hashing for nexthops, the
parsing of nexthops requires telling the decoder functions
that there may be nested attributes. This was found by
code inspection of iproute2/ipnexthop.c when trying to
understand resillient hashing as well as statistics
gathering for nexthops that are / will be in upstream
kernels in the near future.
The default vrf name is obtained by zebra daemon. While isis is not
connected to zebra, i.e. at startup, when loading a startup configuration,
the macro VRF_DEFAULT_NAME is used and returns 'default'.
But because zebra connected and forces to a new default vrf name, the
configuration is not seen as the default one, and further attempts to
configure the isis instance via 'router isis 1' will trigger creation
of an other instance.
To handle this situation, at vrf_enable() event, which is called for
each default vrf name change, the associated isis instance is updated
with th new vrf name. The same is done for NB yang path.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>