Sarita Patra [Fri, 21 Aug 2020 06:33:09 +0000 (23:33 -0700)]
bgpd: Fix BGP session stuck in OpenConfirm state
Issue:
1. Initially BGP start listening to socket.
2. Start timer expires and BGP tries to connect to peer and moved
to Idle->connect (lets say peer datastructre X)
3. Connect for X succeeds and hence moved from idle ->connect with
FD-x.
4. A incoming connection is accepted and a new peer datastructure Y
is created with FD-y moves from idle->Active state.
5. Peer datastercture Y FD-y sends out OPEN and moves to
Active->Opensent state.
6. Peer datastrcture Y FD-y receives OPEN and moved from Opensent->
Openconfirm state.
7. Meanwhile on peer datastrcture X FD-x sends out a OPEN message
and moved from connect->Opensent.
8. For peer datastrcture Y FD-y keep alive is received and it is
moved from OpenConfirm->Established.
9. In this case peer datastructure Y FD-y is a accepted connection
so we try to copy all its parameter to peer datastructure X and
delete Y.
10. During this process TCP connection for the accepted connection
(FD-y) goes down and hence get remote address and port fails.
11. With this failure bgp_stop function for both peer datastrure X
and peer datastructure Y is called.
12. By this time all the parameters include state for datastrcture
for X and Y are exchanged. Peer Y FD-y when it entered this
function had state OpenConfirm still which has been moved to peer
datastrcture X.
13. In bgp_stop it will stop all the timers and take action only if
peer is in established state. Now that peer datastrcture X and Y
are not in established state (in this function) it will simply
close all timers and close the socket and assigns socket for both
the peer datastrcture to -1.
14. Peer datastrcture Y will be deleted as it is a datastrcture created
due to accept of connection where as peer datastrcture X will be held
as it is created with configuration.
15. Now peer datastrcture X now holds a state of OpenConfirm without any
timers running.
16. With this any new incoming connection will never be able to establish
as there is config connection X which is stuck in OpenConfirm.
Fix:
While transferring the peer datastructure Y FD-y (accepted connection)
to the peer datastructure X, if TCP connection for FD-y goes down, then
1. Call fsm event bgp_stop for X (do cleanup with bgp_stop and move the
state to Idle) and
2. Call fsm event bgp_stop for Y (do cleanup with bgp_stop and gets deleted
since it is an accept connection).
Sarita Patra [Fri, 21 Aug 2020 06:29:08 +0000 (23:29 -0700)]
bgpd: Don't stop hold timer in OpenConfirm State
Issue:
1. Initially BGP start listening to socket.
2. Start timer expires and BGP tries to connect to peer and moved
to Idle->connect (lets say peer datastructre X)
3. Peer datastrcture Y FD-X receives OPEN and moved from Opensent->
Openconfirm state and start the hold timer.
4. In the OpenConfirm state, the hold timer is stopped. So peer X
waits for Keepalive message from peer. If the Keepalive message
is not received, then it will be in OpenConfirm state for
indefinite time.
5. Due to this it neither close the existing connection nor it will
accept any connection from peer.
Fix:
In the OpenConfirm state, don't stop the hold timer.
1. Upon receipt of a neighbor’s Keepalive, the state is moved to
Established.
2. But If the hold timer expires, a stop event occurs, the state
is moved to Idle.
This is as per RFC.
Renato Westphal [Wed, 19 Aug 2020 23:33:40 +0000 (20:33 -0300)]
lib: adapt plugin to use new Sysrepo version
Sysrepo recently underwent a complete rewrite, where some substantial
architectural changes were made (the most important one being the
extinction of the sysrepod daemon). While most of the existing API
was preserved, quite a few backward-incompatible changes [1] were
introduced (mostly simplifications). This commit adapts our sysrepo
northbound plugin to those API changes in order for it to be compatible
with the latest Sysrepo version.
Additional notes:
* The old Sysrepo version is EOL and not supported anymore.
* The new Sysrepo version requires libyang 1.x.
Renato Westphal [Wed, 19 Aug 2020 22:48:21 +0000 (19:48 -0300)]
staticd: fix warning when creating routes without SR-TE colors
The SR-TE color YANG leaf is optional so it shouldn't be created
unconditionally (it doesn't have a default value).
Fixes warnings like this when routes are created without specifying
a SR-TE color:
STATIC: libyang: Invalid value "" in "srte-color" element.
(/frr-routing:routing/control-plane-protocols/control-plane-protocol[type='frr-s
taticd:staticd'][name='staticd'][vrf='default']/frr-staticd:staticd/route-list[p
refix='99.0.0.1/32'][afi-safi='frr-routing:ipv4-unicast']/path-list[distance='1'
]/frr-nexthops/nexthop[nh-type='ip4'][vrf='default'][gateway='192.168.1.2'][inte
rface='(null)']/srte-color)
Donald Sharp [Wed, 19 Aug 2020 14:11:06 +0000 (10:11 -0400)]
zebra: Add table id to debug output
There are a bunch of places where the table id is not being outputed
in debug messages for routing changes. Add in the table id we
are operating on. This is especially useful for the case where
pbr is working.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Donald Sharp [Mon, 17 Aug 2020 17:52:19 +0000 (13:52 -0400)]
bgpd: Actually respect RFC 6286 for router_id
The RFC states:
The BGP Identifier is a 4-octet, unsigned, non-zero integer that
should be unique within an AS. The value of the BGP Identifier
for a BGP speaker is determined on startup and is the same for
every local interface and every BGP peer.
We were going slightly beyond this and ensuring that the address
was a specific range of addresses which is no longer relevant.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Renato Westphal [Fri, 14 Aug 2020 22:49:41 +0000 (19:49 -0300)]
lib: don't ignore error messages generated during the commit apply phase
While a configuration transaction can't be rejected once it reaches
the APPLY phase, we should allow NB callbacks to generate error
or warning messages when a configuration change is being applied.
That should be useful, for example, to return warnings back to
the user informing that the applied configuration has some kind of
inconsistency or is missing something in order to be effectively
activated. The infrastructure for this was already present, but the
northbound layer was ignoring all errors/warnings generated during
the apply/abort phases instead of returning them to the user. This
commit changes that.
In the gRPC plugin, extend the Commit() RPC adding a new
"error_message" field to the response type. This is necessary to
allow errors/warnings to be returned even when the commit operation
succeeds (since grpc::Status::OK doesn't support error messages
like the other status codes).
isisd : Transformational changes to support different VRFs.
1. Created a structure "isis master".
2. All the changes are related to handle ISIS with different vrf.
3. A new variable added in structure "isis" to store the vrf name.
4. The display commands for isis is changed to support different VRFs.
Donald Sharp [Thu, 13 Aug 2020 00:59:47 +0000 (20:59 -0400)]
tools: Remove zebra commands that have never existed
The support bundle feature(tm) asks for some data
from zebra in the form of a command that has
never existed in FRR. Looks like some
cruft snuck in remove.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Rafael Zalamena [Wed, 12 Aug 2020 11:26:45 +0000 (08:26 -0300)]
topotests: fix default BFD peer shutdown state
The commit `bfdd: simplify and remove duplicated code` fixed a problem
that was causing the protocol configuration to override the user
configuration.
In this test case: the peer was configured to be disabled (default is
`shutdown`) and the test was expecting it to get activated (`no shutdown`)
when the protocol converged. I changed the peer default state to
`no shutdown`, however another way to get the same effect is to
configure the protocol to use a profile or don't configure a peer at all
(and use the defaults).
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Donald Sharp [Wed, 12 Aug 2020 13:49:20 +0000 (09:49 -0400)]
lib: Properly handle POLLERR from poll()
There are situations where POLLERR will be returned. But
since we were not handling it. Thread processing effectively
is turned into an infinite loop, which is bad.
Modify the code so that if we receive a POLLERR we turn it
into a read event to be handled as an error from the handler
function.
Chirag Shah [Fri, 7 Aug 2020 20:01:58 +0000 (13:01 -0700)]
zebra: Revert "zebra: probe local inactive neigh"
Reverting probing of neigh entry. There is a timing where
probe and remote macip add request comes at the same time resulting
in neigh to remain in local state event though it should be remote.
In mobility case, the host moves to remote VTEP, first MAC only type-2
route is received which triggers a PROBE of neighs (associated to MAC).
PROBE request can go via network port to remote VTEP.
PROBE request picks up local neigh with MAC entry's outgoing port is
remote VTEP tunnel port.
The PROBE reply and MAC-IP (containing IP) almost comes same time at
DUT.
DUT first processes remote macip and installs neigh as remote.
Followed by receives neigh as REACHABLE which marks neigh as LOCAL.
FRR does have BPF filter which does not allow its own netlink request
to receive. Otherwise frr's request to program neigh as remote can move
neigh from local to remote.
Though ordering can not be guranteed that REACHABLE (PROBE's repsonse)
can come at anytime and move it to LOCAL.
This fix would not suffice the needs of converging LOCAL inactive neighs
to remove from DB. As mobility draft sugges to PROBE local neigh when
MAC moves to remote but it is not working with current framework.
Warning logs -
Logic error: Dereference of null pointer in zebra_evpn_mh.c, function zebra_evpn_es_evi_show_vni, line 360
See https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13544/artifact/shared/static_analysis/report-b1eb72.html#EndPath
Pat Ruddy [Thu, 23 Jul 2020 21:51:10 +0000 (14:51 -0700)]
zebra: rename vni to evpn where appropriate
The main zebra_vni_t hash structure has been renamed to zebra_evpn_t
to allow for other transport underlays. Rename functions and variables
to reflect this change.
Sebastien Merle [Fri, 14 Feb 2020 14:49:25 +0000 (14:49 +0000)]
staticd: add support for SR Policies
Configuration example:
ip route 9.9.9.9/32 6.6.6.6 color 123
The SR Policy to be chosen is uniquely identified by the policy
endpoint (6.6.6.6) and the SR-TE color (123). Traffic will be
augmented with an MPLS label stack according to the active
candidate path of that particular policy.
Renato Westphal [Tue, 11 Aug 2020 23:58:00 +0000 (20:58 -0300)]
grpc: relicense the northbound service description file
The GPL license is too restrictive and can "contaminate" client scripts
using language bindings auto-generated from this file. We don't want
this to happen. Adopt the BSD-2-Clause license to avoid any problem
of this nature.
Quentin Young [Tue, 11 Aug 2020 18:24:56 +0000 (14:24 -0400)]
vrrpd: log errmsg, stricter nb validation
* When failing a config transaction due to a VRID conflict, describe the
error in the provided space
* When validating, allow the NB userdata lookup for interface object to
soft fail; but when applying, assert if it does not exist
Rafael Zalamena [Thu, 6 Aug 2020 19:25:44 +0000 (16:25 -0300)]
bfdd: implement passive mode
The passive mode is briefly described in the RFC 5880 Bidirectional
Forwarding Detection (BFD), Section 6.1. Overview:
> A system may take either an Active role or a Passive role in session
> initialization. A system taking the Active role MUST send BFD
> Control packets for a particular session, regardless of whether it
> has received any BFD packets for that session. A system taking the
> Passive role MUST NOT begin sending BFD packets for a particular
> session until it has received a BFD packet for that session, and thus
> has learned the remote system's discriminator value. At least one
> system MUST take the Active role (possibly both). The role that a
> system takes is specific to the application of BFD, and is outside
> the scope of this specification.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Quentin Young [Thu, 30 Jul 2020 20:44:46 +0000 (16:44 -0400)]
vrrpd: fix improper NB query during validation
We were querying the NB for an interface and expecting it to exist, but
we were doing this during a validation run when the interface hasn't yet
been created, resulting in an abort. Adjust validation checks to handle
this scenario.
Quentin Young [Tue, 2 Jun 2020 19:33:05 +0000 (15:33 -0400)]
vrrpd: don't allow autocreated vr's in NB layer
Changing properties on an autoconfigured VRRP instance results in its
pointer being stored as a userdata in the NB tree, leading to UAF when
autoconfigure deletes the instance and then later NB operations take
place using the now-stale pointer.
Ticket: CM-29850 Signed-off-by: Quentin Young <qlyoung@cumulusnetworks.com>