[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RRG] BGP path hunting, MRAI timer and Path Length Damping
I am finalising some technical writing explaining path hunting,
based on the understanding I initially gained from Geoff's
article - but looking more closely, my understanding of what
would happen doesn't match Geoff's diagram.
I don't see how Geoff's example, with my understanding of BGP
and its MRAI timer, would lead to something resembling Tony's
explanation either.
I am keen to establish whether Geoff's explanation is realistic
or not. My understanding of Sriram's message is that he thinks
it is not.
Tony, can you advise on this? Anyone else?
The second question is whether the MRAI timer applies to
withdrawals. I think it does (as I explain below), but Sriram
thinks not.
Geoff's diagram :
http://www.potaroo.net/ispcol/2007-06/dampbgp.html
explains the pattern of behavior which results in a series of
longer best-path messages emanating out from the local part of
the network, followed by a withdrawal.
I am not sure about the time elapsed between the four sections
of Fig 1, which I will call A, B, C and D.
Geoff doesn't explain it in text, and Tony gives this explanation:
http://tools.ietf.org/html/draft-li-bgp-stability-01#section-1.3.1
. . . when a prefix is withdrawn from the network, as the
local BGP speaker receives the prefix withdrawal from a BGP
peer it may substitute a previously received announcement
for this prefix from a different peer in its place and
announce this replacement route to its peers in response to
the received withdrawal. If this replacement path is
subsequently withdrawn, the local BGP speaker will again
select another previously received announcement from a
different peer, if one exists. This process may continue
until the original prefix withdrawal has propagated across
the entire routing space.
Here is my understanding, based on Geoff's diagram and Tony's
written explanation. Is there a better explanation somewhere?
A This is the steady state before T = 0.0, when (2) detects
that its link to (1) goes down.
B, sort of . . .
T = 0.1 secs
(2) sends withdrawals to (5) and (3).
This is almost immediately after the link goes down - say
a 0.1 second processing time after (2) determines that (1)
is unreachable. This is the time it takes to generate two
withdrawal announcements.
(3) gets the withdrawal too. This is where my understanding
departs from the diagram. In B, (5) has sent an
announcement but 3 hasn't. I figure they would typically
send their announcements at about the same time.
In my understanding, at 0.1 seconds, both (5) and (3) have
been sent the withdrawal from (2) but have not yet processed
them.
T = 0.2 secs
This is a 0.1 second processing delay after (5) gets the
withdrawal from (2), it decides that its current best path
"5, 2, 1" is invalid and so adopts the next best path: "5,
3, 2, 1".
It announces this new, longer, best path (AA+) to its peers,
and we are only interested in the announcement which goes
out to other peer routers (upstream peers) to the top right
of the diagram. That AA+ triggers a cascade of such
announcements through many other routers, perhaps across the
entire Net, if there is no better alternative - which there
isn't, according to this diagram.
(5) sets its FIB to send the packets to (3) rather than (2).
[ Sidebar on Root Cause Notification
[
[ My understanding of RCN (Root Cause Notification) is
[ that with RCN, (5) would be able to tell that this new
[ longer path is not going to work either, and neither
[ would the other path(s) it knows about. So with RCN
[ (5) would now announce a withdrawal to its upstream
[ peers, which is optimal behavior.
[
[ If these numbers identified routers, even without
[ RCN, (5) could figure out that the "5, 3, 2, 1" path
[ wouldn't work, because (2) has told (5) it can't
[ reach (1). However to minimise BGP messages and RIB
[ state, these numbers identify Autonomous Systems, and
[ there's no definite reason for (5) to assume that (3)
[ can't reach (1) via some other router in AS (2) -
[ although it does have to assume that AS 2 has internal
[ routing problems if (3) can reach (1) via AS (2) but
[ that 5's peer router in AS (2) can't.
(5) sets its MRAI timer, so it will expire at T = 30.2
seconds.
(3) responds to the withdrawal it got from (2) by announcing
a withdrawal to (5) and (4). It doesn't use (4)'s path to
(1) because that path flows through (3)'s own Autonomous
System.
(3)'s MRAI timer is also set to time out at T = 30.2 seconds
but that does not concern us.
T = 0.3 secs
(5) receives from (3) the withdrawal of the path to (1).
At this point in time, (5) chooses the previously announced
path via (4): "4, 3, 2, 1", and applies it to its FIB.
However, (5) will not announce it because its MRAI timer has
been set and will expire at T = 30.2 seconds.
(4) sends a withdrawal announcement to (5).
T = 0.4 secs
(5) processes the withdrawal it received from (4).
Since there are no further paths available to (1) from (5),
(5) changes its RIB state for this prefix to reflect the
fact it has no such path - and it configures its FIB to drop
packets addresses to the prefix.
However, (5) does not announce anything because its MRAI
timer is still running.
T = 30.2 secs
(5)'s MRAI timer expires, so it sends the withdrawal to its
upstream peers.
So in my understanding, there is a 30.2 second period of packets
being dropped, while for 30.1 seconds of this time (5) still
says it has a path "5, 3, 2, 1" for this prefix. This may be
preferred by many other routers, even though they actually have
a longer, path (not shown) to (1). So this may cause lost
connectivity to (1) for 30 seconds even though it has another,
much longer, link to other routers somewhere.
In my understanding, there is no strung out sequence of longer
and longer paths being announced by 5, as Geoff's diagram
depicts. Just the first announcement at T = 0.1s of a longer
path, followed by a withdrawal at T = 30.2 seconds. This does
not really resemble the "path hunting" pattern Geoff and Tony
refer to.
My understanding of what Geoff writes is that "path hunting"
occurs anyway, and is exacerbated in variations in the time
constant of the Minimum Route Advertisement Interval (MRAI)
Timer. I am not sure if these are small variations of fractions
of a second from the default 30 seconds, or some other default
or configured values such as 20 to 50 seconds. I don't see how
various values of MRAI in different routers would affect my
explanation of what happens in the situation depicted by Geoff's
diagram, other than to change the delay time to 20.2 or 50.2
seconds, for instance.
In an earlier message ("Geoff Huston's article ...") I wrote
about my understanding of Tony's Li's proposal for Path
Length Damping (PLD):
http://tools.ietf.org/html/draft-li-bgp-stability-01#section-3.4
which involves a new algorithm looking for updates giving a
longer path than previously, (AA+ in Geoff's article) and
suppressing announcements regarding this prefix which also
extend the length of the best path, for a period of time.
I guess this would need to replace the MRAI timer's function, or
operate with longer time constants than that timer, but I don't
think Tony's I-D mentions this.
Tony's I-D
http://tools.ietf.org/html/draft-li-bgp-stability-01#section-3.4
cites Geoff's article and includes:
We therefore suggest that this event could be used to
trigger a suppression period, which would allow the
prefix to oscillate arbitrarily without propagation to
the remainder of the network.
When a BGP speaker found that it needed to change its best
path for a prefix and that the new best path was longer than
the previous best path, then it would suppress the
advertisement of the longer AS path to its neighbor and start
a timer. Subsequent changes to the prefix that continue to
lengthen the AS path would restart the timer. If the BGP
speaker installed a shorter best path, or undertook a local
withdrawal it would remove the suppressed update and cancel
the timer. Otherwise, when the timer expired, the BGP
speaker would advertise the result, if any.
This sounds like a great idea. I understand that when applied
to Geoff's diagram, here is what would happen, assuming for the
moment that there was no MRAI timer:
T = 0.1 seconds
(2) sends withdrawals to (5) and (3).
T = 0.2 seconds.
(5) gets (2)'s withdrawal and does not announce anything,
since its best path "3, 2, 1" is now longer than before,
and Tony's (Geoff's?) PLD algorithm is invoked.
(5)'s PLD timer is set to count down, lets say with a
40 second time constant. Absolute time of expiry will be
40.2 secs.
(5) changes its FIB to direct the packets to the next
best path it knows, via (3).
(3) gets (2)'s withdrawal too. Since (3) has no other
path, it announces a withdrawal to (5) and (4).
T = 0.3 seconds
(5) gets a withdrawal from (3) and does not announce
anything, since its best path is now longer than before:
"4, 3, 2, 1".
(5)'s PLD timer is reset and made to count down, lets say
from 40 seconds. Absolute time of expiry will be 40.3 secs.
(5) changes its FIB to direct the packets to the next
best path it knows, via (4).
(4) gets (3)'s withdrawal too - and sends a withdrawal to
(5).
T = 0.4 seconds
(5) gets a withdrawal from (4).
Because this eliminates the last path (5) had to (1):
"4, 3, 2, 1", (5) needs to announce a withdrawal.
It does so immediately, and disables the PLD timer.
(5) configures its FIB to drop packets addressed to this
prefix.
So with this PLD algorithm, as I understand it, there is only
one message sent by (5), and it is the correct one, sent at 0.4
seconds, not at 30.2 seconds as with current arrangements.
With PLD, (5) exhibits to its peers no false optimism about
delivering packets to (1), other than waiting for a very short
time (ensured by the PLD algorithm in the other routers, and by
the topology) to announce the withdrawal as soon as it was
certain it had was no path to (1).
Geoff wrote to the list:
> The withdrawal would propagate, or the path would stabilise.
> One way to view it is as a more selective form of Min Route
> Advert Interval Timer.
Responding this and previous discussion of my message, Sriram wrote:
> My understanding is that MRAI does not apply to
> explicit withdrawals (the AWs and WWs in your taxonomy).
However, my interpretation of the RFC:
http://tools.ietf.org/html/rfc4271#section-9.2.1.1
is that the MinRouteAdvertisementIntervalTimer (MRAI Timer),
which defaults to 30 seconds (Section 10), applies to all
announcements, with no exemption for withdrawals.
Two UPDATE messages sent by a BGP speaker to a peer that
advertise feasible routes and/or withdrawal of unfeasible
routes to some common set of destinations MUST be separated
by at least MinRouteAdvertisementIntervalTimer.
So I think Sriram's understanding is incorrect.
> Did you happen to verify this based on the observed data?
> The reason I am asking is if this were implemented,
> then in your illustration of Fig. 1 (path hunting),
> all the withdraws (W) will make it quickly to Router 5.
I am not sure whether this last line refers to "this" being the
implementation of PLD or the way he thinks BGP routers already
behave, with the MRAI timer not stopping withdrawal messages.
However, the behavior Sriram goes on to describe accords with my
own interpretation, as described above, of how the routers would
behave with current BGP.
> BGP Router 5 may have sent one AA+ in response to the very first withdraw
> from Router 2 but any subsequent AA+ it tries to send will be
> held back for MRAI interval and during that time the
> other withdraws from Routers 3 and 4 will make it to Router 5
> (because they are not subjected to MRAI),
So far, I agree with Sriram's interpretation. I understand his
"not subjected to MRAI" as meaning that the withdrawal cascade
from (2) to (3) to (4) propagates rapidly (0.1 sec per step in
my explanation) because this is the first change regarding this
prefix in quite a long time.
However, I think Sriram believes a second, or perhaps the only
reason, the cascade is so quick is because he believes that the
MRAI timer does not apply to withdrawals.
> and hence Router 5
> will send only one AA+ and then a withdraw.
If the MRAI timer did not stop withdrawals, then the outcome
would be the same as for my understanding of what would happen
with Tony's PLD algorithm: (5) would propagate the withdrawal at
T = 0.4 seconds.
> If "not applying MRAI to explicit withdrawals" is implemented,
> then withdrawals propagate much faster than AA+ or AA0 updates,
> hence path hunting related updates would be significantly reduced.
> I would think this is how MRAI was intended to expedite
> BGP convergence and reduce churn.
This may have been the intention, but to my understanding, the
MRAI timer delays withdrawals so they are announced at least 30
seconds after all previous announcements. The highly meshed
topology of Geoff's diagram, which is typical of the Internet,
ensures that (5) first announces an AA+ and then waits 30
seconds before announcing the withdrawal.
Here is the summary of my current understanding. Please correct
it!
Geoff's diagrams don't give enough timing detail for me to
critique properly, but I think that with current BGP behavior,
the outcome for the rest of the BGP routers is really bad:
T = 0.2 sec AA+ longer path announcement
T = 30.2 sec Withdrawal.
Sriram seems to believe MRAI was developed with an intent to
achieve this, and thinks the RFC specifies that it should not
stop withdrawals. But my understanding of the RFC is that MRAI
does delay withdrawals, in any circumstance like (5)'s, where it
has a next-best path which it announces first, because it hasn't
yet received a withdrawal from (3) or (4).
Tony, you co-wrote the RFC - what do you think?
- Robin http://www.firstpr.com.au/ip/ivip/
--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg