Stephen Worley [Fri, 21 Oct 2022 16:45:50 +0000 (12:45 -0400)]
bgpd,doc: limit InQ buf to allow for back pressure
Add a default limit to the InQ for messages off the bgp peer
socket. Make the limit configurable via cli.
Adding in this limit causes the messages to be retained in the tcp
socket and allow for tcp back pressure and congestion control to kick
in.
Before this change, we allow the InQ to grow indefinitely just taking
messages off the socket and adding them to the fifo queue, never letting
the kernel know we need to slow down. We were seeing under high loads of
messages and large perf-heavy routemaps (regex matching) this queue
would cause a memory spike and BGP would get OOM killed. Modifying this
leaves the messages in the socket and distributes that load where it
should be in the socket buffers on both send/recv while we handle the
mesages.
Also, changes were made to allow the ringbuffer to hold messages and
continue to be filled by the IO pthread while we wait for the Main
pthread to handle the work on the InQ.
Memory spike seen with large numbers of routes flapping and route-maps
with dozens of regex matching:
```
Memory statistics for bgpd:
System allocator statistics:
Total heap allocated: > 2GB
Holding block headers: 516 KiB
Used small blocks: 0 bytes
Used ordinary blocks: 160 MiB
Free small blocks: 3680 bytes
Free ordinary blocks: > 2GB
Ordinary blocks: 121244
Small blocks: 83
Holding blocks: 1
```
With most of it being held by the inQ (seen from the stream datastructure info here):
```
Type : Current# Size Total Max# MaxBytes
...
...
Stream : 115543 variable 26963208159707403571708768
```
With this change that memory is capped and load is left in the sockets:
RECV Side:
```
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 265350 0 [fe80::4080:30ff:feb0:cee3]%veth1:36950 [fe80::4c14:9cff:fe1d:5bfd]:179 users:(("bgpd",pid=1393334,fd=26))
skmem:(r403688,rb425984,t0,tb425984,f1816,w0,o0,bl0,d61)
```
SEND Side:
```
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 1275012 [fe80::4c14:9cff:fe1d:5bfd]%veth1:179 [fe80::4080:30ff:feb0:cee3]:36950 users:(("bgpd",pid=1393443,fd=27))
skmem:(r0,rb131072,t0,tb1453568,f1916,w1300612,o0,bl0,d0)
```
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Stephen Worley [Fri, 21 Oct 2022 15:18:12 +0000 (11:18 -0400)]
bgpd: fix vni_str NULL check in evpn rt show run
Fix the vni_str NULL check for wildcard route-targets
in evpn show run. This will never be NULL if we add 1
here. Though it should also never be NULL since ":" should
always exist. Better to be safe than sorry.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Donald Sharp [Wed, 19 Oct 2022 16:57:28 +0000 (12:57 -0400)]
lib: Remove unnecessary comparison, for linked list
In the comparison function for a linked list code was
always checking against passed in NULL's. The comparison
function will never receive a NULL value for data from
the linklist.c code.
Donald Sharp [Wed, 19 Oct 2022 16:44:55 +0000 (12:44 -0400)]
zebra: Fix debug of filtering out prefix due to routemap
The debug for notification about a filtered prefix was
just printing the nexthop ifindex and vrf id. Not all
nexthops have this data. Just print out the actual nexthop
Sarita Patra [Tue, 11 Oct 2022 01:38:14 +0000 (18:38 -0700)]
pimd, pim6d: Don't configure link-local, Multicast, Unspecified address as RP
Problem:
=======
frr(config)# do show ipv6 pim interface
Interface State Address PIM Nbrs PIM DR FHR IfChannels
ens192 up fe80::250:56ff:feb7:3619 0 local 0 1
Configure ens192 interface link-local address as RP.
frr(config)# ipv6 pim rp fe80::250:56ff:feb7:3619
No Path to RP address specified: fe80::250:56ff:feb7:3619
frr(config)# do show ipv6 pim rp-info
RP address group/prefix-list OIF I am RP Source Group-Type
fe80::250:56ff:feb7:3619 ff00::/8 Unknown yes Static ASM
Fix:
===
RP should not be link-local, multicast and unspecified address.
Introduced common api pim_process_unicast_bsm_cmd,
pim_process_no_unicast_bsm_cmd which will process
both "[no] ip pim unicast-bsm" command and "[no] ipv6 pim
unicast-bsm" command.
Currently, the SID transposition algorithm implemented in bgpd handles
incorrectly the SRv6 locators with function length greater than 20 bits.
To prevent issues, we currently limit the function length to 20 bits.
This limit will be removed when the bgpd SID transposition is fixed.
According to RFC 8986, the SRv6 SID length cannot exceed 128 bits. This
commit ensures that the condition
`block_len + node_len + function_len + arg_len <= 128` is satisfied when
a new SRv6 locator is created.
This commit extends the `bgp_srv6l3vpn_to_bgp_vrf3` topotest by adding
two tests:
* prevent bgpd from exporting routes from a VRF to the VPN RIB
(`no sid vpn per-vrf export`);
* enable bgpd to export routes from a VRF to the VPN RIB
(`sid vpn per-vrf export auto`).
The command `sid vpn per-vrf export (1-255)|auto` can be used to export
IPv4 and IPv6 routes from a VRF to the VPN RIB using a single SRv6 SID
(End.DT46 behavior).
This commit implements the no form of the above command, which can be
used to disable the export of the IPv4/IPv6 routes:
`no sid vpn per-vrf export`.
This commit adds a new test case to the
test_zebra_seg6local_route topotest. The new test case performs two
operations:
* try to install a seg6local route with an End.DT46 action
* verify that the route is created correctly
In the current implementation of bgpd, SRv6 SIDs can be configured only
under the address-family. This enables bgpd to leak IPv6 routes using
an SRv6 End.DT6 behavior and IPv4 routes using an SRv6 End.DT4
behavior. It is not possible to leak both IPv6 and IPv4 routes using a
single SRv6 SID.
This commit adds a new CLI command
"sid vpn per-vrf export <sid_idx|auto>" that enables bgpd to leak both
IPv6 and IPv4 routes using a single SRv6 SID (End.DT46 behavior).
This commit adds the SRv6 locator's block length, node length and
argument length to the output of the command
"show segment-routing srv6 locator json"
zebra: add missing bits len to SRv6 locator detail
This commit adds SRv6 locator's block length, node length and argument
length to the output of the command
"show segment-routing srv6 locator NAME detail [json]".
zebra: add new CLI args "block-len" and "node-len"
In the current implementation, an SRv6 locator only supports the
following structure:
* node-len = 24
* block-len = prefix-len - 24
* function-len = <configurable>
* argument-len = 0
This commit adds two optional arguments to the locator_prefix CLI
command: "node-len" and "block-len". These arguments allows an user to
configure the block length and node length of a SRv6 locator according
to the following logic:
* the node-len + block-len = prefix-len constraint must always be
satisfied;
* if node-len and block-len are both omitted, they are calculated as in
the current implementation (for backward compatibility reasons)
* if node-len is omitted, its value is computed as
prefix-len - block-len
* if block-len is omitted, its value is computed as
prefix-len - node-len
Donald Sharp [Wed, 12 Oct 2022 20:05:23 +0000 (16:05 -0400)]
ospfd: Allow unnumbered and numbered addresses to co-exist better
When forming a neighbor relationship on an interface, ospf is
currently evaluating unnumbered as highest priority, without
any consideration for if you have /32's and non /32's on the
interface. Effectively if I have something like this:
int foo0
ip address 192.168.119.1/24
!
router ospf
network 0.0.0.0/0 area 0
!
ospf will form a neighbor on foo0 if it exists. Now
suppose someone does this:
int foo0
ip address 192.168.120.1/32
This will create the unnumbered interface on foo0 and
the peering will come down immediately.
The problem here is that the original designers of the unnumbered
code for ospf didn't envision end operators mixing and matching
addresses on an interface like this ( for perfectly legitimate
reasons I might add ).
So if ospf has both numbered and unnumbered let's match against
the numbered first and then unnumbered. This solves the problem
Fixes: #6823 Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Donald Sharp [Wed, 12 Oct 2022 18:53:21 +0000 (14:53 -0400)]
bgpd: Allow `network XXX` to work with bgp suppress-fib-pending
When bgp is using `bgp suppress-fib-pending` and the end
operator is using network statements, bgp was not sending
the network'ed prefix'es to it's peers. Fix this.
Also update the test cases for bgp_suppress_fib to test
this new corner case( I am sure that there are going to
be others that will need to be added ).
Fixes: #12112 Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Donald Sharp [Tue, 11 Oct 2022 17:21:03 +0000 (13:21 -0400)]
lib: Free some memory in scripting subsystem at shutdown
Pre:
staticd: showing active allocations in memory group libfrr
staticd: memstats: Scripting : 16 * (variably sized)
staticd: memstats: Hash : 2 * (variably sized)
staticd: memstats: Hash Bucket : 8 * 32
staticd: memstats: Hash Index : 1 * (variably sized)
staticd: memstats: Link List : 1 * 40
staticd: memstats: Link Node : 1 * 24
staticd: showing active allocations in memory group logging subsystem
staticd: memstats: log file target : 1 * 88
staticd: showing active allocations in memory group staticd
Post:
staticd: showing active allocations in memory group libfrr
staticd: showing active allocations in memory group logging subsystem
staticd: memstats: log file target : 1 * 88
staticd: showing active allocations in memory group staticd