Donald Sharp [Mon, 16 Sep 2019 18:25:55 +0000 (14:25 -0400)]
watchfrr: Convert `wtf` to a more meaningful message
There is a fairly common state we are seeing where watchfrr
has decided that something is not right and is printing out
a `wtf` message. At this point I am not sure what is going on
or how we are getting here, but let's add a bit more data dump
to the message so that we can figure out what is going on.
This is mainly being done because at this point in time I have no
clue the what/how of how we got here and I cannot reproduce.
Maybe by adding more useful information here I can figure out what is
going on.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com.
Donald Sharp [Mon, 16 Sep 2019 17:47:50 +0000 (13:47 -0400)]
watchfrr: Allow end users to turn off watchfrr for a particular daemon
Allow an end user who is debugging behavior, with say gdb, to turn
off watchfrr and it's attempts to keep control of a daemons up/responsiveness
With code change:
donna.cumulusnetworks.com# show watchfrr
watchfrr global phase: Idle
zebra Up
bgpd Up/Ignoring Timeout
staticd Up
Now grab bgpd with gdb:
sharpd@donna ~/frr4> date ; sudo gdb -p 27893
Mon 16 Sep 2019 01:44:57 PM EDT
GNU gdb (GDB) Fedora 8.3-6.fc30
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 27893
[New LWP 27894]
[New LWP 27895]
[New LWP 27896]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f1787a3e5c7 in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29-15.fc30.x86_64 gperftools-libs-2.7-5.fc30.x86_64 json-c-0.13.1-4.fc30.x86_64 libcap-2.26-5.fc30.x86_64 libgcc-9.1.1-1.fc30.x86_64 libgcrypt-1.8.4-3.fc30.x86_64 libgpg-error-1.33-2.fc30.x86_64 libstdc++-9.1.1-1.fc30.x86_64 libxcrypt-4.4.6-2.fc30.x86_64 libyang-0.16.105-1.fc30.x86_64 lua-libs-5.3.5-5.fc30.x86_64 lz4-libs-1.8.3-2.fc30.x86_64 pcre-8.43-2.fc30.x86_64 xz-libs-5.2.4-5.fc30.x86_64
(gdb)
In another window we can see when watchfrr thinks it's not responding:
donna.cumulusnetworks.com# show watchfrr
watchfrr global phase: Idle
zebra Up
bgpd Unresponsive/Ignoring Timeout
staticd Up
Finally exit gdb and watchfrr now believes bgpd is good to go again:
donna.cumulusnetworks.com# show watchfrr
watchfrr global phase: Idle
zebra Up
bgpd Up/Ignoring Timeout
staticd Up
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Quentin Young [Mon, 16 Sep 2019 15:33:49 +0000 (15:33 +0000)]
bgpd: do not send keepalives when KA timer is 0
RFC4271 specifies behavior when the hold timer is sent to zero - we
should not send keepalives or run a hold timer. But FRR, and other
vendors, allow the keepalive timer to be set to zero with a nonzero hold
timer. In this case we were sending keepalives constantly and maxing out
a pthread to do so. Instead behave similarly to other vendors and do not
send keepalives.
Unsure what the utility of this is, but blasting keepalives is
definitely the wrong thing to do.
Signed-off-by: Quentin Young <qlyoung@cumulusnetworks.com>
vdhingra [Tue, 27 Aug 2019 10:45:54 +0000 (03:45 -0700)]
lib: rmap dep table is not correct in case of exact-match clause
User pass the string match large-community 1 exact-match from CLI.
Now route map lib has got the string as "1 exact-match". It passes the string
to call back for compilation. BGP will parse this string and came to know
that for "1" it has to do exact match. Routemap lib has to save "1" in it’s
dependency table. Here routemap is saving this as a “1 exact-match”
which is wrong. The solution is used the compiled data.
Donald Sharp [Fri, 13 Sep 2019 20:43:16 +0000 (16:43 -0400)]
bgpd: Create `set distance XXX` command for routemaps
Allow bgp to set a local Administrative distance to use
for installing routes into the rib.
Example:
!
router bgp 9323
bgp router-id 1.2.3.4
neighbor enp0s8 interface remote-as external
!
address-family ipv4 unicast
neighbor enp0s8 route-map DISTANCE in
exit-address-family
!
route-map DISTANCE permit 10
set distance 153
!
line vty
!
end
eva# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route
B 0.0.0.0/0 [153/0] via fe80::a00:27ff:fe84:c2d6, enp0s8, 00:00:06
K>* 0.0.0.0/0 [0/100] via 10.0.2.2, enp0s3, 00:06:31
B>* 1.1.1.1/32 [153/0] via fe80::a00:27ff:fe84:c2d6, enp0s8, 00:00:06
B>* 1.1.1.2/32 [153/0] via fe80::a00:27ff:fe84:c2d6, enp0s8, 00:00:06
B>* 1.1.1.3/32 [153/0] via fe80::a00:27ff:fe84:c2d6, enp0s8, 00:00:06
C>* 10.0.2.0/24 is directly connected, enp0s3, 00:06:31
K>* 169.254.0.0/16 [0/1000] is directly connected, enp0s3, 00:06:31
eva#
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Stephen Worley [Wed, 11 Sep 2019 18:22:55 +0000 (14:22 -0400)]
pbrd: Handle GATEWAY_IFINDEX nht conflicts
In pbrd we did not care if the nexthop interface nexthop tracking
sent us back did not match the one specified with `nexthop [GATEWAY]
[INTERFACE]`. This happened if the gateway was resolvable via a
different interface and the inteface we specified in the config was
unnumbered (no ipv4 address on it) since the default route gets forced
onlink when it gets into zebra.
This patch adds a check to not install the rule if the interface we got
back was different from the specified.
This patch also reworks the nexthop update path to make it a little more
clear what its doing.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
bgpd: prefer-global command not working on IPv4 peers
`set ipv6 next-hop prefer-global` is not working on IPv4 peers.
In MP-BGP, bgp routers can advertising IPv6 routes over IPv4 peers.
Remove the peer's remote address AFI type checking.
Donald Sharp [Tue, 10 Sep 2019 23:48:21 +0000 (19:48 -0400)]
tests: Add admin distance 255 static routes
Add a couple of test cases to ensure that admin distance of
255 actually causes the route to be accepted by zebra but
not installed into the linux kernel.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Donald Sharp [Wed, 11 Sep 2019 03:16:01 +0000 (23:16 -0400)]
tests: Fix topotests due to json error
Recent commit: 5fba22485b added a new topotest that used
an older version of FRR that referenced some json code
that was changed in between when the PR was submitted
and when it got in.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Quentin Young [Mon, 9 Sep 2019 16:59:09 +0000 (16:59 +0000)]
yang: create interface reference type
Instead of copy-pasting a 16 character string type for use as an
interface reference, create a new typedef that leafref's the name node
of an interface. This way the constraints change with the constraints on
an interface name itself, and it's self documenting.
Incidentally ripd and ripngd forgot the 16 character constraint in their
offset-list configs and IS-IS forgot it entirely, so this also fixes
minor bugs.
Signed-off-by: Quentin Young <qlyoung@cumulusnetworks.com>
The Pim RFC does not appear to state any length requirements
of pim, other than the checksum must be correct.
Certain vendors are sending extra data at the end of a pim assert
message. This while not explicitly against the rules was a bit
of surprise to pim when we threw the assert message on the floor
for being too long.
Modify the test to see if length left will allow us to read
the 8 bytes of data that we need. If it is sufficient for
that allow the packet to be used.
Fixes: #4957 Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
There's no need to install MPLS FTNs using the ZEBRA_ROUTE_ADD
message since the ZEBRA_MPLS_LABELS_ADD message already does that
(in addition to installing an MPLS LSP).
rgirada [Fri, 10 May 2019 05:35:48 +0000 (22:35 -0700)]
staticd: static route config should fail if gw configured as its local ip.
Fix:
Added a check in staticd upon receiving nexthop update from zebra such that
it will fail to resolve the nexthop if the connected address added as nexthop.
But still allowing to add to staticd database and appears in running config.
Throwing an warning massage to user if such misconfig issued.
Renato Westphal [Thu, 8 Aug 2019 20:08:36 +0000 (17:08 -0300)]
zebra: improve cleanup of MPLS labels when zclient disconnects
Use the zserv_client_close hook to cleanup all MPLS labels advertised
by a zclient when it disconnects. We were doing this cleanup for
ldpd only, but now we have other daemons that are MPLS aware,
like ospfd (due to the new Segment Routing feature).
Renato Westphal [Thu, 8 Aug 2019 16:56:39 +0000 (13:56 -0300)]
zebra: identify MPLS FTNs by route type and instance
Use the route type and instance instead of the route distance
to identify MPLS FTNs. This is a more robust approach since the
routing daemons can modify the distance of their announced routes
via configuration, which can cause inconsistencies.
Renato Westphal [Thu, 8 Aug 2019 00:06:03 +0000 (21:06 -0300)]
lib: introduce encode/decode functions for the MPLS zapi messages
Do this for the following reasons:
* Improve modularity of the code by separating the decoding of the
ZAPI messages from their processing;
* Create an API that is easier to use by the client daemons.
Renato Westphal [Wed, 28 Aug 2019 15:14:10 +0000 (12:14 -0300)]
tests: remove topotest compatibility with older ldpd versions
Now that topotest was integrated into the FRR repository, we
don't need to worry anymore about creating tests that work across
different FRR versions. The topotests present on any branch need
to be compatible only with the FRR daemons from that same branch.
Chirag Shah [Tue, 27 Aug 2019 21:41:00 +0000 (14:41 -0700)]
bgpd: clear l3vni prefix-only flag upon deletion
When L3vni is created with prefix-only flag,
the flag is set at bgp vrf instance level.
In the case of bgp instance is non auto created,
means user configured instance (i.e 'router bgp x vrf <name>')
Upon deletion of l3vni, clear the prefix-only flag from
bgp vrf instance.
Donald Sharp [Fri, 6 Sep 2019 12:46:27 +0000 (08:46 -0400)]
tests: Ensure we wait 1 bgp timeout period before declaring failure
The lib/bgp.py test code is bringing up neighbors and clearing them
to test that things are working appropriately. The problem we have
is that we are only waiting 30 seconds for declaration of failure.
In a high load system packets can be lost and as such the initial
convergence may not happen. Modify the test to wait for 1 retry
window test period before declaring failure.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Commit eaf6705d7a fixed a problem caused by configuration changes
coming from the kernel. The fix consisted of regenerating the
candidate configuration before every configuration command (when
using the non-transactional CLI mode). There's no need, however,
to regenerate the candidate when it's identical to the running
configuration. Since the northbound keeps track of the version
of each configuration, we can use that information to prevent
regenerating the candidate configuration when that is not necessary.
Mark Stapp [Thu, 5 Sep 2019 16:58:58 +0000 (12:58 -0400)]
zebra: avoid using zebra datastructs in evpn dataplane path
Some netlink-facing code used for evpn/vxlan programming was
being run in the dataplane pthread, but accessing zebra core
datastructs. Move some additional data into the dataplane
context, and use it in the netlink path instead.
Donald Sharp [Thu, 5 Sep 2019 16:30:26 +0000 (12:30 -0400)]
ospfd: Remove flog_warn for a situation user can never do anything with
When OSPF receives a Database description packet and is in
`Down`, `Attempt` or `2-Way` state we are creating a warning
for the end user.
rfc2328 states(10.6):
Down - The packet should be rejected
Attempt - The packet should be rejected
2-Way - The packet should be ignored
I cannot find any instructions in the rfc to state what the operational
difference is between rejected and ignored. Neither can I figure
out what FRR expects the end user to do with this information.
I can see this information being useful if we encounter a bug
down the line and we have gathered a bunch of data. As such
let's modify the code to remove the flog_warn and convert
the message to a debug level message that can be controlled by
appropriate debug statements.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Stephen Worley [Wed, 4 Sep 2019 16:38:56 +0000 (12:38 -0400)]
staticd: Re-send/Remove routes on interface events
We were not processing interface up/down events for device only
static routes. This patch looks up the ifp and then calls
the same API we are using for interface add/remove events.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>