Donald Sharp [Fri, 28 Mar 2025 18:58:01 +0000 (14:58 -0400)]
bgpd: On shutdown, unlock table when clearing the bgp metaQ
There are some tables not being freed upon shutdown. This
is happening because the table is being locked as dests
are being put on the metaQ. When in shutdown it was clearing
the MetaQ it was not unlocking the table
Modified the fsm state machine to attempt to not
clear routes on a peer that was not established.
The peer should be not a peer self. We do not want
to ever clear the peer self.
Donald Sharp [Wed, 26 Mar 2025 15:14:57 +0000 (11:14 -0400)]
tests: Modify simple_snmp_test to use frr.conf
The simple_snmp_test was not properly testing
the rip snmp code because of weirdness w/ mgmtd
and non-integrated configs. Modify the whole
test to use a integrated config and voila
ripd is talking snmp again in the test.
Donatas Abraitis [Wed, 26 Mar 2025 12:47:44 +0000 (14:47 +0200)]
tests: Use label 0x800000 instead of 0x000000 for BMP tests
Related-to: 94e2aadf7187d7d695babce21033b5bc8e454f25 ("bgpd: Set the label for MP_UNREACH_NLRI 0x800000 instead of 0x000000") Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Martin Buck [Tue, 25 Mar 2025 15:53:12 +0000 (16:53 +0100)]
tests: Fix wait times in test_ospf6_gr_topo1 topotest
Increase wait times to at least the minimum wait time accepted by
topotest.run_and_expect(). Also change poll interval to 1s, no point in
doings this more frequently.
Finally, slightly improve the topology diagram to also include area numbers.
Signed-off-by: Martin Buck <mb-tmp-tvguho.pbz@gromit.dyndns.org>
Donatas Abraitis [Wed, 26 Mar 2025 08:30:52 +0000 (10:30 +0200)]
bgpd: Set the label for MP_UNREACH_NLRI 0x800000 instead of 0x000000
RFC8277 says:
The procedures in [RFC3107] for withdrawing the binding of a label
or sequence of labels to a prefix are not specified clearly and correctly.
=> How to Explicitly Withdraw the Binding of a Label to a Prefix
Suppose a BGP speaker has announced, on a given BGP session, the
binding of a given label or sequence of labels to a given prefix.
Suppose it now wishes to withdraw that binding. To do so, it may
send a BGP UPDATE message with an MP_UNREACH_NLRI attribute. The
NLRI field of this attribute is encoded as follows:
Upon transmission, the Compatibility field SHOULD be set to 0x800000.
Upon reception, the value of the Compatibility field MUST be ignored.
[RFC3107] also made it possible to withdraw a binding without
specifying the label explicitly, by setting the Compatibility field
to 0x800000. However, some implementations set it to 0x000000. In
order to ensure backwards compatibility, it is RECOMMENDED by this
document that the Compatibility field be set to 0x800000, but it is
REQUIRED that it be ignored upon reception.
In FRR case where a single label is used per-prefix, we should send 0x800000,
and not 0x000000.
Donatas Abraitis [Tue, 25 Mar 2025 15:35:41 +0000 (17:35 +0200)]
tests: Check if routes are marked as stale and retained with N-bit for GR
Related-to: b7c657d4e065f310fcf6454abae1a963c208c3b8 ("bgpd: Retain the routes if we do a clear with N-bit set for Graceful-Restart") Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Shbinging [Fri, 21 Mar 2025 02:57:28 +0000 (02:57 +0000)]
babeld: fix hello packets not sent with configured hello timer
Same issue occurring as previously addressed in https://github.com/FRRouting/frr/pull/9092. The root cause is: "Sending a Hello message before restarting the hello timer to avoid session flaps in case of larger hello interval configurations."
Mark Stapp [Mon, 24 Mar 2025 20:53:32 +0000 (16:53 -0400)]
bgpd: fix handling of configured RTs for l2vni, l3vni
Test for existing explicit config as part of validation of
route-target configuration: allow explicit config of generic/
default AS+VNI, for example, instead of rejecting it.
Modified the bgp_fsm code to dissallow the extension
of the hold time when the system is under extremely
heavy load. This was a attempt to remove the return
code but it was too aggressive and messed up this bit
of code.
zmw12306 [Mon, 24 Mar 2025 19:32:18 +0000 (15:32 -0400)]
babeld: Hop Count must not be 0.
According to RFC 8966:
Hop Count The maximum number of times that this TLV may be forwarded, plus 1. This MUST NOT be 0. Signed-off-by: zmw12306 <zmw12306@gmail.com>
Donald Sharp [Mon, 24 Mar 2025 12:07:02 +0000 (08:07 -0400)]
eigrpd: Cleanup memory issues on shutdown
a) EIGRP was having issues with the prefix created as part
of the topology destination. Make this just a part of the
topology data structure instead of allocating it.
b) EIGRP was not freeing up any memory associated with
the network table. Free it.
c) EIGRP was confusing zebra shutdown as part of the deletion
of the last eigrp data structure. This was inappropriate it
should be part of the `I'm just shutting down`.
d) The QOBJ was not being properly freed, free it.
Donald Sharp [Mon, 24 Mar 2025 01:16:56 +0000 (21:16 -0400)]
eigrpd: Convert the eiflist to a typesafe hash
The eigrp->eiflist is a linked list and should just
be a hash instead. The full conversion to a hash
like functionality is goingto wait until the connected
eigrp data structure is created.
Christian Hopps [Mon, 24 Mar 2025 05:07:28 +0000 (05:07 +0000)]
tests: add another directory to search path for pylint
Some IDEs (e.g., emacs+lsp) run pylint from the root directory and so
we need to add `tests/topotests` so that `lib` and `munet` are found
by pylint when used in imports
Donald Sharp [Sun, 23 Mar 2025 21:48:02 +0000 (17:48 -0400)]
tests: high_ecmp creates 2 update groups
The high_ecmp test was creating 2 update groups, where
513 of the neighbors are in 1 and the remaining is in
another. They should just all be in 1 update group.
Modify the test creation such that interfaces r1-eth514
and r2-eth514 have v4 and v6 addresses.
Signed-off-by: Donald Sharp <donaldsharp72@gmail.com>
staticd: Fix crash that occurs when modifying an SRv6 SID
When the user modifies an SRv6 SID and then removes all SIDs, staticd
crashes:
```
2025/03/23 08:37:22.691860 STATIC: lib/memory.c:74: mt_count_free(): assertion (mt->n_alloc) failed
STATIC: Received signal 6 at 1742715442 (si_addr 0x8200007cf0); aborting...
STATIC: zlog_signal+0x390 fcc704a844b8ffffd7450390 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: core_handler+0x1f8 fcc704b79990ffffd7450590 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: ---- signal ----
STATIC: ? fcc705c008f8ffffd74507a0 linux-vdso.so.1 (mapped at 0xfcc705c00000)
STATIC: pthread_key_delete+0x1a0 fcc70458f1f0ffffd7451a00 /lib/aarch64-linux-gnu/libc.so.6 (mapped at 0xfcc704510000)
STATIC: raise+0x1c fcc70454a67cffffd7451ad0 /lib/aarch64-linux-gnu/libc.so.6 (mapped at 0xfcc704510000)
STATIC: abort+0xe4 fcc704537130ffffd7451af0 /lib/aarch64-linux-gnu/libc.so.6 (mapped at 0xfcc704510000)
STATIC: _zlog_assert_failed+0x3c4 fcc704c407c8ffffd7451c40 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: mt_count_free+0x12c fcc704a93c74ffffd7451dc0 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: qfree+0x28 fcc704a93fa0ffffd7451e70 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: static_srv6_sid_free+0x1c adc1df8fa544ffffd7451e90 /usr/lib/frr/staticd (mapped at 0xadc1df8a0000)
STATIC: delete_static_srv6_sid+0x14 adc1df8faafcffffd7451eb0 /usr/lib/frr/staticd (mapped at 0xadc1df8a0000)
STATIC: list_delete_all_node+0x104 fcc704a60eecffffd7451ed0 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: list_delete+0x8c fcc704a61054ffffd7451f00 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: static_srv6_cleanup+0x20 adc1df8fabdcffffd7451f20 /usr/lib/frr/staticd (mapped at 0xadc1df8a0000)
STATIC: sigint+0x40 adc1df8be544ffffd7451f30 /usr/lib/frr/staticd (mapped at 0xadc1df8a0000)
STATIC: frr_sigevent_process+0x148 fcc704b79460ffffd7451f40 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: event_fetch+0x1c4 fcc704bc0834ffffd7451f60 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: frr_run+0x650 fcc704a5d230ffffd7452080 /usr/lib/frr/libfrr.so.0 (mapped at 0xfcc704800000)
STATIC: main+0x1d0 adc1df8be75cffffd7452270 /usr/lib/frr/staticd (mapped at 0xadc1df8a0000)
STATIC: __libc_init_first+0x7c fcc7045373fcffffd74522b0 /lib/aarch64-linux-gnu/libc.so.6 (mapped at 0xfcc704510000)
STATIC: __libc_start_main+0x98 fcc7045374ccffffd74523c0 /lib/aarch64-linux-gnu/libc.so.6 (mapped at 0xfcc704510000)
STATIC: _start+0x30 adc1df8be0f0ffffd7452420 /usr/lib/frr/staticd (mapped at 0xadc1df8a0000)
```
Tracking this down, the crash occurs because every time we modify a
SID, staticd executes some callbacks to modify the SID and finally it
calls `apply_finish`, which re-adds the SID to the list `srv6_sids`.
This leads to having the same SID multiple times in the `srv6_sids`
list. When we delete all SIDs, staticd attempts to deallocate the same
SID multiple times, which leads to the crash.
This commit fixes the issue by moving the code that adds the SID to the
list from the `apply_finish` callback to the `create` callback.
This ensures that the SID is inserted into the list only once, when it
is created. For all subsequent modifications, the SID is modified but
not added to the list.
Donald Sharp [Fri, 21 Mar 2025 22:08:25 +0000 (18:08 -0400)]
tests: Change up start order of bmp tests
Currently the tests appear to do this:
a) Start the neighbors
b) Start the bmp server connection
c) Look for the neighbors up
d) Look for the neighbor up messages in the bmp log
This is not great from a testing perspective in that
even though we started a) first it may not happen
until after b) happens. Or even worse if it is
partially up ( 1 of the 2 peers ) then the dump
will have the neighbor connecting after parts
of the table. This doesn't work too well because
the SEQ number is something that is kept and compared
to to make sure only new data is being looked at.
Let's modify the startup configuration to start
the bmp server first and then have a delayopen
on the bgp neighbor statements so that the bmp
peering can come up first.
Soumya Roy [Fri, 14 Mar 2025 22:01:51 +0000 (22:01 +0000)]
zebra: reduce memory usage by streams when redistributing routes
This commit undo 8c9b007a0c7efb2e9afc2eac936ba9dd971c6707
stream lib has been modified to expand the stream if needed
Now for zapi route encode, we use expandable stream
Soumya Roy [Fri, 14 Mar 2025 21:56:48 +0000 (21:56 +0000)]
zebra: zebra crash for zapi stream
Issue:
If static route is created with a BGP route as nexthop, which
recursively resolves over 512 ECMP v6 nexthops, zapi nexthop encode
fails, as there is not enough memory allocated for stream. This causes
assert/core dump in zebra. Right now we allocate fixed memory
of ZEBRA_MAX_PACKET_SIZ size.
Fix:
1)Dynamically calculate required memory size for the stream
2)try to optimize memory usage
Testing:
No crash happens anymore with the fix
zebra: zebra crash for zapi stream
Issue:
If static route is created with a BGP route as nexthop, which
recursively resolves over 512 ECMP v6 nexthops, zapi nexthop encode
fails, as there is not enough memory allocated for stream. This causes
assert/core dump in zebra. Right now we allocate fixed memory
of ZEBRA_MAX_PACKET_SIZ size.
Fix:
1)Dynamically calculate required memory size for the stream
2)try to optimize memory usage
Testing:
No crash happens anymore with the fix
r1#
r1# sharp install routes 2100:cafe:: nexthop 2001:db8::1 1000
r1#
Soumya Roy [Fri, 14 Mar 2025 21:44:39 +0000 (21:44 +0000)]
lib: Add support for stream buffer to expand
Issue:
Currently, during encode time, if required memory is
more than available space in stream buffer, stream buffer
can't be expanded. This fix introduces new apis to support
stream buffer expansion.
Testing:
Tested with zebra nexthop encoding with 512 nexthops, which triggers
this new code changes, it works fine. Without fix, for same trigger
it asserts.
Donald Sharp [Wed, 12 Mar 2025 16:37:21 +0000 (12:37 -0400)]
bgpd: Tie in more clear events to clear code
The `clear bgp *` and the interface down events
cause a global clearing of data from the bgp rib.
Let's tie those into the clear peer code such
that we can take advantage of the reduced load
in these cases too.
Mark Stapp [Wed, 12 Mar 2025 13:55:53 +0000 (09:55 -0400)]
bgpd: Allow batch clear to do partial work and continue later
Modify the batch clear code to be able to stop after processing
some of the work and to pick back up again. This will allow
the very expensive nature of the batch clearing to be spread out
and allow bgp to continue to be responsive.
Tuetuopay [Mon, 17 Mar 2025 14:08:15 +0000 (15:08 +0100)]
bgpd: fix evpn attributes being dropped on input
All assignments of the EVPN attributes (ESI and Gateway IP) are gated
behind the peer being set up for inbound soft-reconfiguration.
There are no actual reasons for this limitation, so let's perform the
EVPN attribute assignment no matter what when soft reconfiguration is
not enabled.
Fixes: 6e076ba5231 ("bgpd: Fix for ain->attr corruption during path update") Signed-off-by: Tuetuopay <tuetuopay@me.com>
Donald Sharp [Wed, 19 Mar 2025 20:50:11 +0000 (16:50 -0400)]
bgpd: Fix leaked memory when showing some bgp routes
The two memory leaks:
==387155== 744 (48 direct, 696 indirect) bytes in 1 blocks are definitely lost in loss record 222 of 262
==387155== at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==387155== by 0x4C1B982: json_object_new_object (in /usr/lib/x86_64-linux-gnu/libjson-c.so.5.1.0)
==387155== by 0x2E4146: peer_adj_routes (bgp_route.c:15245)
==387155== by 0x2E4F1A: show_ip_bgp_instance_neighbor_advertised_route_magic (bgp_route.c:15549)
==387155== by 0x2B982B: show_ip_bgp_instance_neighbor_advertised_route (bgp_route_clippy.c:722)
==387155== by 0x4915E6F: cmd_execute_command_real (command.c:1003)
==387155== by 0x4915FE8: cmd_execute_command (command.c:1062)
==387155== by 0x4916598: cmd_execute (command.c:1228)
==387155== by 0x49EB858: vty_command (vty.c:626)
==387155== by 0x49ED77C: vty_execute (vty.c:1389)
==387155== by 0x49EFFA7: vtysh_read (vty.c:2408)
==387155== by 0x49E4156: event_call (event.c:2019)
==387155== by 0x4958ABD: frr_run (libfrr.c:1247)
==387155== by 0x206A68: main (bgp_main.c:557)
==387155==
==387155== 2,976 (192 direct, 2,784 indirect) bytes in 4 blocks are definitely lost in loss record 240 of 262
==387155== at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==387155== by 0x4C1B982: json_object_new_object (in /usr/lib/x86_64-linux-gnu/libjson-c.so.5.1.0)
==387155== by 0x2E45CA: peer_adj_routes (bgp_route.c:15325)
==387155== by 0x2E4F1A: show_ip_bgp_instance_neighbor_advertised_route_magic (bgp_route.c:15549)
==387155== by 0x2B982B: show_ip_bgp_instance_neighbor_advertised_route (bgp_route_clippy.c:722)
==387155== by 0x4915E6F: cmd_execute_command_real (command.c:1003)
==387155== by 0x4915FE8: cmd_execute_command (command.c:1062)
==387155== by 0x4916598: cmd_execute (command.c:1228)
==387155== by 0x49EB858: vty_command (vty.c:626)
==387155== by 0x49ED77C: vty_execute (vty.c:1389)
==387155== by 0x49EFFA7: vtysh_read (vty.c:2408)
==387155== by 0x49E4156: event_call (event.c:2019)
==387155== by 0x4958ABD: frr_run (libfrr.c:1247)
==387155== by 0x206A68: main (bgp_main.c:557)
For the 1st one, if the operator issues a advertised-routes command, the
json_ar variable was never being freed.
For the 2nd one, if the operator issued a command where the
output_count_per_rd is 0, we need to free the json_routes value.
Donald Sharp [Wed, 19 Mar 2025 19:20:31 +0000 (15:20 -0400)]
tests: Ensure that the daemon has connected to zebra
On daemon startup, ensure that the daemon is there and
connected to zebra. There are some exceptions,
pathd is srte. pim6d and pimd are the same at the
moment and finally smnptrapd.
This should help the startup of using a unified
config in the topotests.
Donald Sharp [Wed, 19 Mar 2025 16:22:04 +0000 (12:22 -0400)]
zebra: Allow fpm_listener to reject all routes
Now usage of `-r -f` with fpm_listener now causes all
routes to be rejected.
r1# sharp install routes 10.0.0.0 nexthop 192.168.44.5 5
r1# show ip route
Codes: K - kernel route, C - connected, L - local, S - static,
R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric, t - Table-Direct,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
IPv4 unicast VRF default:
D>o 10.0.0.0/32 [150/0] via 192.168.44.5, r1-eth0, weight 1, 00:00:02
D>o 10.0.0.1/32 [150/0] via 192.168.44.5, r1-eth0, weight 1, 00:00:02
D>o 10.0.0.2/32 [150/0] via 192.168.44.5, r1-eth0, weight 1, 00:00:02
D>o 10.0.0.3/32 [150/0] via 192.168.44.5, r1-eth0, weight 1, 00:00:02
D>o 10.0.0.4/32 [150/0] via 192.168.44.5, r1-eth0, weight 1, 00:00:02
C>* 192.168.44.0/24 is directly connected, r1-eth0, weight 1, 00:00:37
L>* 192.168.44.1/32 is directly connected, r1-eth0, weight 1, 00:00:37
r1#
Nathan Bahr [Wed, 19 Mar 2025 16:07:37 +0000 (16:07 +0000)]
lib: Create VRF if needed
When creating a control plane protocol through NB, create the vrf
if needed instead of only looking up and asserting if it doesn't
exist yet.
Fixes 18429.