Renato Westphal [Thu, 18 Nov 2021 17:52:20 +0000 (14:52 -0300)]
ospfd: fix incorrect detection of topology changes in helper mode
This commit fixes a rather obscure bug that was causing the GR
topotest to fail on a frequent basis.
RFC 3623 specifies that a router acting as a helper to a restarting
neighbor should monitor topology changes and abort the GR procedures
when one is detected, falling back to normal OSPF operation.
ospfd uses the ospf_lsa_different() function to detect when the
content of an LSA has changed, which is considered as a topology
change. The problem is that ospf_lsa_different() can return true
even when the two LSAs passed as parameters are identical, provided
one LSA has the OSPF_LSA_RECEIVED flag set and the other not.
In the context of the ospf_gr_topo1 test, router rt6 performs
a graceful restart and a few seconds later acts as a helper for
router rt7. When it's acting as a helper for rt7, it still didn't
translate its NSSA Type-7 LSAs, something that happens only after 7
seconds (OSPF_ABR_TASK_DELAY) of the first SPF run. The translated
Type-5 LSAs on its LSDB were learned from the helping neighbors
(rt3 and rt7). It's then possible that the NSSA Type-7 LSAs might
be translated while rt6 is acting as helper for rt7, which causes
the daemon to detect a non-existent topology change only because
the OSPF_LSA_RECEIVED flag is unset in the recently originated
Type-5 LSA.
Fix this problem by ignoring the OSPF_LSA_RECEIVED flag when
comparing LSAs for the purpose of topology change detection.
In short, the bug would only show up when the restarting router
would start acting as a helper immediately after coming back up
(which would be hard to happen in the real world). The topotest
failures became more frequent after commit 6255aad0bc78c1 because of
the removal of the 'sleep' calls, which used to give ospfd more time
to converge before start acting as a helper for other routers. The
problem still occurred from time to time though.
Donald Sharp [Sat, 20 Nov 2021 00:18:30 +0000 (19:18 -0500)]
tests: Fix tests using exabgp to explicitly call out which python to use
There exist systems that do not explicity have a python soft-link
on their system. Let's explicity call out which python we want
to be using with exabgp.
pimd : packet processing optimization on rp change
Problem Statement:
==================
on rp_change, PIM processes all the upstream in a loop and for selected
upstreams PIM has to send join/prune based on the RPF changed.
join and prune packets are not getting aggregated in a single packet.
Root Cause Analysis:
====================
on pim_rp_change pim_upstream_update() gets called for selected upstreams.
This API calculates to whom it has to send join and to whom it has to
send prune via API pim_zebra_upstream_rpf_changed(). This API peprares
the upstream_switch_list list per interface and inserts the group and
sources.
Now PIM is still in the pim_upstream_update() API context, i.e PIM
is still processing the same upstream. In the last there is a
call to pim_zebra_update_all_interfaces() which processes the
upstream_switch_list list, sends the packets out and clears the list.
Fix:
====
Don't process the upstream_switch_list in the upstream context.
process all the upstreams prepare the upstream_switch_list and then
process in one go. This will club all the S,G entries.
It also saves list cleanup with respect to memory allocation and
deallocation multiple times.
Mark Stapp [Thu, 18 Nov 2021 12:21:34 +0000 (07:21 -0500)]
zebra: during shutdown, don't process LSPs on the lsp workqueue
During zebra shutdown, we clear out the LSP workqueue. The LSPs
will be uninstalled and freed during the shutdown process, so
just ignore any LSPs that happen to be on the workqueue.
Upon code inspection there was no place where we disabled the t_write thread upon ospf6 deletion.
If the code were to issue a `no router ospf6` and then recreate it. We could see this crash.
David Lamparter [Sun, 24 Oct 2021 11:46:06 +0000 (13:46 +0200)]
pimd: fix event order for forward_stop()
`pim_ifchannel_ifjoin_switch()` changes flags that `pim_forward_stop()`
looks at. This leads to data flow continuing until we have some reason
to sync state again.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Donald Sharp [Wed, 17 Nov 2021 13:51:14 +0000 (08:51 -0500)]
tests: Re-add the ability to generate core files with topotests
Somewhere along the line core-files stopped being generated
with the running of the topotests. With this change we now
see this:
sharpd@eva /t/topotests> find . -name '*.dmp' -print
./ospfv3_basic_functionality.test_ospfv3_asbr_summary_topo1/r0/ospf6d_core-sig_6-pid_430478.dmp
sharpd@eva /t/topotests> sudo gdb /usr/lib/frr/ospf6d ./ospfv3_basic_functionality.test_ospfv3_asbr_summary_topo1/r0/ospf6d_core-sig_6-pid_430478.dmp
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/frr/ospf6d...
[New LWP 430478]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/ospf6d --log file:ospf6d.log --log-level debug -d'.
Program terminated with signal SIGABRT, Aborted.
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
(gdb)
Donald Sharp [Thu, 30 Sep 2021 17:56:15 +0000 (13:56 -0400)]
zebra: Expand v4/v6 route space
At some scale we eventually run out of room displaying v4/v6 route
totals for `show zebra client summ`:
janelle# show zebra client summ
Name Connect Time Last Read Last Write IPv4 Routes IPv6 Routes
--------------------------------------------------------------------------------
bgp 04w0d18h 00:00:19 00:01:2411729127/40526812037786/903094
This total over ran the space in just a little over a week of uptime.
Expand to have a bit more room.
Donald Sharp [Thu, 28 Oct 2021 14:42:23 +0000 (10:42 -0400)]
zebra: return void for dplane_ctx_get_pbr_ipset_entry
The dplane_ctx_get_pbr_ipset_entry function only
failed when the caller did not pass in a valid
usable pointer. Change the code to assert on
a pointer not being passed in and remove the
bool return
Donald Sharp [Thu, 28 Oct 2021 14:35:51 +0000 (10:35 -0400)]
zebra: return void for dplane_ctx_get_pbr_iptable
The only time this function ever failed is when
the developer does not pass in a usable pointer
to place the data in. Change it to an assert
to signify to the end developer that is what
we want and then remove all the if checks
for failure
Donald Sharp [Thu, 28 Oct 2021 13:15:44 +0000 (09:15 -0400)]
zebra: dplane_ctx_get_pbr_ipset should return void
The function call dplane_ctx_get_pbr_ipset only
returns false when the calling function fails to
pass in a valid ipset pointer. This should
be an assertion issue since it's a programming
issue as opposed to an actual run time issue.
Change the function call parameter to not return
a bool on success/fail for a compile time decision.
David Lamparter [Thu, 22 Jul 2021 09:49:08 +0000 (11:49 +0200)]
pimd: correctly process rp-count==0 BSMs
rp-count==0 isn't a broken BSM, it just means the BSR no longer has any
Candidate RPs for the group range. Previous behavior is badly mistaken
since it stops processing the entire packet.
Fix to correctly remove group range on rp-count==0 and continue
processing remainder of the packet.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
David Lamparter [Fri, 25 Jun 2021 08:53:26 +0000 (10:53 +0200)]
pimd: clean up BSR NHT & fix parallel links
The Bootstrap message RX path needs a RPF check for the BSR address,
and this is implemented both incorrectly as well as quite ugly.
Clean up and fix case when we have multiple interfaces to the same LAN
and/or ECMP nexthops (both would cause message duplication, the former
can even cause BSM forwarding loops.)
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
David Lamparter [Tue, 16 Nov 2021 12:29:44 +0000 (13:29 +0100)]
vtysh: dispatch unique-id backtrace cmd properly
i.e. to whoever cares, since some unique IDs (from libfrr) are valid
everywhere but some others (from the daemons) only apply to specific
daemons.
(Default handling aborts on first error, so configuring any unique IDs
that don't exist on the first daemon vtysh connects to just failed
before this.)
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Igor Ryzhov [Tue, 16 Nov 2021 15:01:03 +0000 (18:01 +0300)]
*: unify if_is_loopback/if_is_loopback_or_vrf
We should always treat the VRF interface as a loopback. Currently, this
is not the case, because in some old pre-VRF code we use if_is_loopback
instead of if_is_loopback_or_vrf. To avoid any future problems, the
proposal is to rename if_is_loopback_or_vrf to if_is_loopback and use it
everywhere. if_is_loopback is renamed to if_is_loopback_exact in case
it's ever needed, but currently it's not used anywhere.
Igor Ryzhov [Mon, 15 Nov 2021 16:45:18 +0000 (19:45 +0300)]
ospf6d: replace memcmp with correct comparisons
Using memcmp with complex structures like prefix or ospf6_ls_origin is
not correct, because even two structures with same values in all fields
may have different values in padding bytes and comparison will fail.
Igor Ryzhov [Mon, 15 Nov 2021 21:00:00 +0000 (00:00 +0300)]
zebra: fix memleak on shutdown
During shutdown, when table_manager_disable is called for the default
VRF, its vrf_id is already set to VRF_UNKNOWN, so the expression is true
and the table manager memory is not freed. Change the expression to
compare the VRF name instead of the id. The check in table_manager_enable
is changed for consistency.
Olivier Dugeon [Mon, 15 Nov 2021 17:19:35 +0000 (18:19 +0100)]
ospfd: Fix wrong parsing of TE subTLV
Function ospf_te_parse_te() and ospf_te_delete_te() browse TE TLV but also
subTLV. The loop that parse the subTLV check that cummulative read data doesn't
exceed the total size of the TLV. However, the sum variable that counts the
number of read data was wrongly intialize to 0 instead to 4 (i.e. the initial
TLV Header size that is located at the TOP of subTLV).
This patch adjust accordingly the initial value of the counter.
As part of the check, it memcompares two structs ospf6_path. This struct
has a pointer field nh_list which is allocated every time a new path is
created, which means it can never be the same for two different paths.
Therefore this check is always false and can be completely removed.