David Lamparter [Thu, 14 Oct 2021 17:11:02 +0000 (19:11 +0200)]
doc/developer: fix warnings in topotests doc
Sphinx warns about a few nits here, just fix. (Note :option:`-E` can't
be used without a "option:: -E" definition, it's intended as a cross
reference.)
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Donald Sharp [Wed, 13 Oct 2021 18:12:51 +0000 (14:12 -0400)]
tests: BFD timing tests under system load need more leeway
We have this pattern in this test:
# Let's kill the interface on rt2 and see what happens with the RIB and BFD on rt1
tgen.gears["rt2"].link_enable("eth-rt1", enabled=False)
# By default BFD provides a recovery time of 900ms plus jitter, so let's wait
# initial 2 seconds to let the CI not suffer.
topotest.sleep(2, 'Wait for BFD down notification')
Under a heavy CI load, interface down events and then reacting to them may not actually
happen within 2 seconds. Allow some more grace time in the test to ensure that we
react to it in an appropriate manner.
Donald Sharp [Wed, 13 Oct 2021 16:46:22 +0000 (12:46 -0400)]
tests: Convert over to using converged to test for ospf being converged
OSPF when it is deciding on whom it should elect for DR and backup
has a process that prioritizes network stabilty over the exact
same results of who is the DR / Backups.
Essentially if we have r1 ----- r2
Let's say r1 has a higher priority, but r2 comes up first, starts
sending hello packets and then decides that it is the DR. At some
point in time in the future, r1 comes up and then connects to r2
at that point it sees that r2 has elected itself DR and it keeps
it that way.
This is by design of the system. With our tight ospf timers as
well as high load being experienced on our test systems. There
exists a bunch of ospf tests that we cannot guarantee that a
consistent DR will be elected for the test. As such let's not
even pretend that we care a bunch and just look for `Full`.
If we care about `ordering` we need to spend more time getting
the tests to actually start routers, ensure that htey are up and
running in the right order so that priority can take place.
Donald Sharp [Wed, 13 Oct 2021 16:40:35 +0000 (12:40 -0400)]
ospfd: Add `converged` and `role` json output for neighbor command
The `show ip ospf neighbor json` command was displaying
state:`Full\/DR`
Where state was both the role and whether or not the neigbhor
was converged. While from a OSPF perspective this is the state.
This state is a combination of two things.
This creates a problem in testing because we have no guarantee
that a particular ospf router will actually have a particular role
given how loaded our topotest systems are. So add a bit of json
output to display both the converged status as well as the
role this router is playing on this neighbor/interface.
The above becomes:
state:`Full\/DR`
converged:`Full`
role:`DR`
Tests can now be modified to look for `Full` and allow it to
continue. Most of the tests do not actually care if this
router is the DR or Backup.
Donald Sharp [Wed, 13 Oct 2021 13:03:27 +0000 (09:03 -0400)]
tests: Fix `Invalid escape sequence` warnings in test runs
Test runs are creating these warnings:
bgp_l3vpn_to_bgp_vrf/test_bgp_l3vpn_to_bgp_vrf.py::test_check_linux_mpls
<string>:7: DeprecationWarning: invalid escape sequence \d
Donald Sharp [Wed, 13 Oct 2021 11:58:37 +0000 (07:58 -0400)]
lib: Add missing enum values in switch statement for if_link_type_str
The switch statement over `enum zebra_link_type` had a default
and FRR was missing a few of the pre-defined types we cared about.
Remove the default statement and add the missing values.
Igor Ryzhov [Fri, 8 Oct 2021 21:22:31 +0000 (00:22 +0300)]
lib: set type for newly created interfaces
Currently, the ll_type is set only in `netlink_interface` which is
executed only during startup. If the interface is created when the FRR
is already running, the type is not stored.
Renato Westphal [Fri, 8 Oct 2021 00:06:01 +0000 (21:06 -0300)]
ospfd: fix display of plain-text data on "show ... json" commands
Add a 'json' parameter to the 'show_opaque_info' callback definition,
and update all instances of that callback to not display plain-text
data when the user requested JSON data.
Donald Sharp [Fri, 8 Oct 2021 11:37:15 +0000 (07:37 -0400)]
tests: Fix ospf[6]_gr_topo1 tests to work better under load
2 things:
a) Each test was setting up for graceful restart with calls to
`graceful-restart prepare ip[v6] ospf`, then sleeping for
3 or 5 seconds. Then killing the ospf process. Under heavy
load there is no guarantee that zebra has received/processed
this signal. Write some code to ensure that this happens
b) Tests are issuing commands in this order:
1) issue gr prepare command
2) kill router
3) <ensure routes were still installed in zebra>
4) start router
5) <ensure routes were stil installed in zebra>
Imagine that the system is under some load and there is
a small amount of time before step 5 happens. In this
case ospf could have come up and started neighbor relations
and also started installing routes. If zebra receives
a new route before step 5 is issued then the route could
be in a state where it is not installed, because it is
being sent to the kernel for installation. This would
fail the test because it would only look 1 time. This
is fixed by giving time on restart for the routes to
be in the installed state.
Renato Westphal [Fri, 8 Oct 2021 02:45:31 +0000 (23:45 -0300)]
ospfd: preserve DR status across graceful restarts
RFC 3623 says:
"If the restarting router determines that it was the Designated
Router on a given segment prior to the restart, it elects
itself as the Designated Router again. The restarting router
knows that it was the Designated Router if, while the
associated interface is in Waiting state, a Hello packet is
received from a neighbor listing the router as the Designated
Router".
Implement that logic when processing Hello messages to ensure DR
interfaces will preserve their DR status across a graceful restart.
Donald Sharp [Thu, 7 Oct 2021 16:08:42 +0000 (12:08 -0400)]
zebra: Display how long zebra is expected to wait for GR
When a client sends to zebra that GR mode is being turned
on. The client also passes down the time zebra should hold
onto the routes. Display this time with the output
of the `show zebra client` command as well.
Igor Ryzhov [Thu, 7 Oct 2021 12:53:10 +0000 (15:53 +0300)]
*: don't pass pointers to a local variables to thread_add_*
We should never pass pointers to local variables to thread_add_* family.
When an event is executed, the library writes into this pointer, which
means it writes into some random memory on a stack.
ospfd: ospf nbr in full although mismatch in hello packet contents
Issue:
===================
OSPF neighbors are not going down even after 10 mins when
having a mismatch in hello and dead interval.
First neighbors are formed and then a mismatch in the interval
is created, it is observed that the neighbor is not going down.
Root Cause Analysis:
====================
The event HelloReceived defined in RFC 2328 was named as PacketReceived
and this event was scheduled whenever LS Update, LS Ack, LS Request,
DD description packet or Hello packet is received.
Although there is a mismatch in the Hello packet contents, the
event PacketReceived gets triggered due to LS Update received and the
dead timer gets reset and hence the neighbor was never going Down and
remains FULL.
Fix:
==================
As per RFC 2328, the HelloReceived needs to be triggered only when
valid OSPF Hello packet is received and not when other OSPF packets
are received. Modified the function name as well.
Donald Sharp [Tue, 5 Oct 2021 00:32:25 +0000 (20:32 -0400)]
watchfrr: Allow an integrated config to work within a namespace
Since watchfrr invokes vtysh to gather the show run output and
write the data, if we are operating inside of a namespace FRR
must also pass this in.
Yes. This seems hacky. I don't fully understand why vtysh
is invoked this way.
New output:
sharpd@eva:~/frr3$ sudo vtysh -N one
Hello, this is FRRouting (version 8.1-dev).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
eva# wr mem
Note: this version of vtysh never writes vtysh.conf
% Can't open configuration file /etc/frr/one/vtysh.conf due to 'No such file or directory'.
Building Configuration...
Integrated configuration saved to /etc/frr/one/frr.conf
[OK]
eva#
David Lamparter [Wed, 6 Oct 2021 16:12:26 +0000 (18:12 +0200)]
doc/developer: document dev tag on master
We have `frr-X.Y-dev` tags on master after pulling stable branches,
otherwise the `gitversion` tooling / `--with-pkg-git-version` gets
_very_ confused (it'll print something like:
Igor Ryzhov [Wed, 6 Oct 2021 14:35:07 +0000 (17:35 +0300)]
lib: fix incorrect thread management
The current code passes an address of a local variable to `thread_add_read`
which stores it into `thread->ref` by the lib. The next time the thread
callback is executed, the lib stores NULL into the `thread->ref` which
means it writes into some random memory on the stack.
To fix this, we should pass a pointer to the vector entry to the lib.
Mark Stapp [Wed, 6 Oct 2021 14:51:04 +0000 (10:51 -0400)]
tests: remove deprecated debug cli from some tests
Some tests had commented-out references to the old "CLI()"
function. Remove those so they're not confusing in the future,
and replace at least one with a comment that uses the
'mininet_cli()' function.
Donald Sharp [Wed, 6 Oct 2021 11:58:35 +0000 (07:58 -0400)]
bgpd: Check return from generic_set_add
Coverity found a couple of spots where FRR was
ignoring the return code of generic_set_add.
Just follow the code pattern for the rest of
the usage in the code.
In startup, zebra would dump interface information from Kernel in 3
steps w/o lock: step1, get interface information; step2, get interface
ipv4 address; step3, get interface ipv6 address.
If any interface gets added after step1, but before step2/3, zebra
would get extra interface addresses in step2/3 that has not been added
into zebra in step1. Returning error in the referenced interface lookup
would cause the startup interface retrieval to be incomplete.
Rafael Zalamena [Tue, 5 Oct 2021 15:37:34 +0000 (12:37 -0300)]
topotests: justify code sleep
Document the `sleep` statement so people know that we are sleeping
because we are waiting for the BFD down notification. If we don't
sleep here it is possible that we get outdated `show` command results.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>