Ryoga Saito [Thu, 16 Dec 2021 14:09:55 +0000 (23:09 +0900)]
bgpd: delete NULL assignment in bgp_attr_hash_alloc
If soft-reconfiguration is enabled, bgp_adj_in_set will be called
from bgp_update and bgp_adj_in_set will call bgp_attr_intern to intern
attr pointer. If given attr isn't found in attrhash, hash_get will call
bgp_attr_hash_alloc to allocate new attr structure. In
bgp_attr_hash_alloc, NULL will be assigned to srv6_vpn field and
srv6_l3vpn field in origin attr pointer. attr->srv6_vpn and
attr->srv6_l3vpn are interned in bgp_attr_intern, so NULL assignment
isn't needed.
And, these fields are used later in bgp_update to set SRv6 information
to bgp_path_info. If bgp_attr_hash_alloc assign NULL to these fields,
SRv6 information will be lost and incorrect routes are inserted into
data-plane.
Donald Sharp [Sat, 11 Dec 2021 17:05:36 +0000 (12:05 -0500)]
tests: test_ospf_lan.py is looking for a certain order enforce it
OSPF when converging will choose a DR / Backup DR based upon
who has already come up. Irrelevant of priority. As such if
under system load OSPF comes up first and elects a DR that under
normal circumstances not be the elected one due to priority
OSPF does not go back through and re-elect to keep the system
stable in this case. Tests are experiencing this:
unet> r0 show ip ospf neigh
Neighbor ID Pri State Up Time Dead Time Address Interface RXmtL RqstL DBsmL
100.1.1.1 99 Full/Backup 4m14s 3.780s 10.0.1.2 r0-s1-eth0:10.0.1.1 0 0 0
100.1.1.2 0 Full/DROther 4m14s 3.848s 10.0.1.3 r0-s1-eth0:10.0.1.1 0 0 0
100.1.1.3 0 Full/DROther 4m14s 3.912s 10.0.1.4 r0-s1-eth0:10.0.1.1 0 0 0
unet> r1 show ip ospf neigh
Neighbor ID Pri State Up Time Dead Time Address Interface RXmtL RqstL DBsmL
100.1.1.0 98 Full/DR 4m15s 3.011s 10.0.1.1 r1-s1-eth1:10.0.1.2 0 0 0
100.1.1.2 0 Full/DROther 4m19s 3.124s 10.0.1.3 r1-s1-eth1:10.0.1.2 0 0 0
100.1.1.3 0 Full/DROther 4m19s 3.188s 10.0.1.4 r1-s1-eth1:10.0.1.2 0 0 0
unet> r2 show ip ospf neigh
Neighbor ID Pri State Up Time Dead Time Address Interface RXmtL RqstL DBsmL
100.1.1.0 98 Full/DR 4m27s 3.483s 10.0.1.1 r2-s1-eth0:10.0.1.3 0 0 0
100.1.1.1 99 Full/Backup 4m32s 3.527s 10.0.1.2 r2-s1-eth0:10.0.1.3 0 0 0
100.1.1.3 0 2-Way/DROther 4m32s 3.660s 10.0.1.4 r2-s1-eth0:10.0.1.3 0 0 0
unet> r3 show ip ospf neigh
Neighbor ID Pri State Up Time Dead Time Address Interface RXmtL RqstL DBsmL
100.1.1.0 98 Full/DR 4m55s 3.786s 10.0.1.1 r3-s1-eth1:10.0.1.4 0 0 0
100.1.1.1 99 Full/Backup 4m55s 3.829s 10.0.1.2 r3-s1-eth1:10.0.1.4 0 0 0
100.1.1.2 0 2-Way/DROther 4m54s 3.897s 10.0.1.3 r3-s1-eth1:10.0.1.4 0 0 0
Modify the test to do a clear to enforce the order we are specifically looking for.
Chirag Shah [Thu, 2 Dec 2021 06:13:37 +0000 (22:13 -0800)]
tools: exit when reload fails to parse config file
frr-reload triggers restart of service in case
it fails to parse new config file and conjunction with
running config contains 'router bgp' (default bgp instnace).
When frr-reload fails to parse new config file, it fails
to build newconfig context (empty object).
Instead of bailing out it compares against the running config
context. If the running config contains default bgp instance
it thinks new config is removing default bgp instance so it
triggers frr restart.
Fix is to to bail out reload script when it fails to parse
config file.
tools: add a script to generate draft release changelog
This utility script helps in generated formatted and consistent
change log including:
1- group logs per daemon
2- standarize daemon names (lowercase, end with d)
3- capitalize all log lines
4- no merge commits
caveat: comments are assumed to be in the form
daemon-name : message
Sample Output:
```
sharpd
Follow the practice on cli design for json output
Install route supports nexthop-seg6 (step3)
Install_routes_helper support zapi_route flags (step1)
snapcraft
Add missing dependency
Add pathd to frr snap daemons
Change base to ubuntu 18.04 and libyang 2.0.7
staticd
Convert typedef to enum
Fix distance processing
Fix late initialization of blackhole type
Output config using nb callbacks instead of operational data
```
Igor Ryzhov [Wed, 17 Nov 2021 23:20:43 +0000 (02:20 +0300)]
bfdd: remove unnecessary receive timer restart
When the detection time expires, we put the session down and restart the
timer. As the comment in the code says, it's needed to zero the remote
discriminator after the second expiration.
But the RFC clearly says that this must be done on the first expiration:
bfd.RemoteDiscr
The remote discriminator for this BFD session. This is the
discriminator chosen by the remote system, and is totally opaque
to the local system. This MUST be initialized to zero. If a
period of a Detection Time passes without the receipt of a valid,
authenticated BFD packet from the remote system, this variable
MUST be set to zero.
And we actually already do it in `ptm_bfd_sess_dn`, so there's no need
to reset the timer and wait for it twice.
Frr-reload failure:
line 179: Failure to communicate[13] to bgpd, line: neighbor 10.2.1.1
remote-as external
% Peer-group member cannot override remote-as of peer-group
line 179: Failure to communicate[13] to bgpd, line: neighbor 10.2.1.2
remote-as external
% Peer-group member cannot override remote-as of peer-group
Igor Ryzhov [Fri, 13 Aug 2021 23:09:54 +0000 (02:09 +0300)]
vtysh: fix duplicated output of key chain configuration
When both ripd and eigrpd run at the same time, all key configuration in
key chain node is duplicated. This change adds a concept of nested nodes
into vtysh to fix the issue.
Igor Ryzhov [Wed, 24 Nov 2021 12:01:41 +0000 (15:01 +0300)]
bfdd: fix detection timeout update
Per RFC 5880 section 6.8.12, the use of a Poll Sequence is not necessary
when the Detect Multiplier is changed. Currently, we update the Detection
Timeout only when a Poll Sequence is terminated, therefore we ignore the
Detect Multiplier change if it's not accompanied with RX/TX timer change.
To fix the problem, we should update the Detection Timeout on every
received packet.
Stephen Worley [Mon, 29 Nov 2021 19:59:06 +0000 (14:59 -0500)]
zebra: add optional NHG ID output to `show ip ro`
Add optional NHG ID output to `show ip route` dumps. We have
this in json output already as nexthopGroupID but nice
to have the option in a normal dump as well. Not including in main
output for now to avoid breaking screen scrapers.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Donald Sharp [Tue, 30 Nov 2021 00:33:48 +0000 (19:33 -0500)]
tests: Fix Daemon Killing to actually notice when a deamon dies
Lot's of the GR topotests kill daemons in order to test code
that deals with crashing daemons. Under heavy system load
it was noticed that a kill command was sent and if told to
wait we would sleep 2 seconds send another kill command and
call it good. This was causiing issues when subsuquent
json commands would get errors like `lost connection to daemon`
as the daemon finally shut down after some time due to load.
Modify the kill the daemon function to notice that the daemon
was not actually killed and if we need to wait wait some
more time for it too happen
Donald Sharp [Mon, 29 Nov 2021 20:51:45 +0000 (15:51 -0500)]
zebra: Prevent thread usage of data after it being freed
On startup we create a thread timer event to do a rib sweep
of the system. On shutdown we never stopped this timer and
as such we have a situation where a thread event could be run
on shutdown after the data for it has been freed. Here is the
crash I am seeing:
(gdb) bt
(gdb)
Save the thread data in zebra_router and stop the thread so we don't
accidently do work on shutdown we don't mean to. In this case
it happened in our topotests with some severe system load.
Essentially we happened to kill the zebra daemon just as the
graceful_restart timer popped here.
Donald Sharp [Mon, 29 Nov 2021 17:11:43 +0000 (12:11 -0500)]
tests: Allow interface statistics to be gathered with some delay
Currently under system load tests that use verify_pim_interface_traffic
immediately after a interface down/up event are not giving any time
for pim to receive and process the data from that event. Give
the test some time to gather this data.
Donald Sharp [Mon, 29 Nov 2021 13:37:21 +0000 (08:37 -0500)]
test: Fix addKernelRoute looking for positive results
Under heavy system load, we are sometimes seeing this
output for addKernelRoute:
2021-11-28 16:17:27,604 INFO: topolog: [DUT: b1]: Running command: [ip route add 224.0.0.13 dev b1-f1-eth0]
2021-11-28 16:17:27,604 DEBUG: topolog.b1: LinuxNamespace(b1): cmd_status("['/bin/bash', '-c', 'ip route add 224.0.0.13 dev b1-f1-eth0']", kwargs: {'encoding': 'utf-8', 'stdout': -1, 'stderr': -2, 'shell': False, 'stdin': None})
2021-11-28 16:17:27,967 DEBUG: topolog.b1: LinuxNamespace(b1): cmd_status("['/bin/bash', '-c', 'ip route']", kwargs: {'encoding': 'utf-8', 'stdout': -1, 'stderr': -2, 'shell': False, 'stdin': None})
2021-11-28 16:17:28,243 DEBUG: topolog: ip route
70.0.0.0/24 dev b1-f1-eth0 proto kernel scope link src 70.0.0.1 Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This tells us that the ip route add succeeded but when looking for it
the system failed to immediately find it. Why is this happening?
Probably we are under heavy system load and the two different
commands, 'ip route add..' and 'ip route show' are being executed
on different cpu's and the data has not been copied to the different
cpu yet in the kernel. This is not necessarily something normally
seen but entirely possible. Giving the system a few extra seconds
for the kernel to execute/work the memory barrier system seems
prudent for long term success of our programming.
Donald Sharp [Sat, 27 Nov 2021 18:12:50 +0000 (13:12 -0500)]
tests: Fix isis_topo1_vrf to wait a tiny bit for zebra route install
During repeated runs I am seeing this test fail to run successfully.
Upon inspecting the output:
{
"prefix":"10.0.10.0/24",
"prefixLen":24,
"protocol":"isis",
"vrfId":6,
"vrfName":"r1-cust1",
"selected":true,
"destSelected":true,
"distance":115,
"metric":10,
"queued":true,
We can see that the route is still queued. Under heavy system
load and not ensuring that isis has time to send the route to
zebra and for zebra to install the route, this test can fail.
Igor Ryzhov [Thu, 25 Nov 2021 18:17:58 +0000 (21:17 +0300)]
ospfd: fix summary-address deletion
When the summary-address is deleted, `ospf_aggr_handle_external_info` is
called for each aggregated route for the cleanup. It needs to find the
corresponding OSPF instance and it does it using the `ei->instance`
which is totally wrong, because it's the instance from which the route
is redistributed, not the local OSPF instance. A pointer to the correct
OSPF instance is already stored in the external_info structure.