[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RRG] BGP path hunting, MRAI timer and Path Length Damping



I am finalising some technical writing explaining path hunting,
based on the understanding I initially gained from Geoff's
article - but looking more closely, my understanding of what
would happen doesn't match Geoff's diagram.

I don't see how Geoff's example, with my understanding of BGP
and its MRAI timer, would lead to something resembling Tony's
explanation either.

I am keen to establish whether Geoff's explanation is realistic
or not.  My understanding of Sriram's message is that he thinks
it is not.

Tony, can you advise on this?  Anyone else?

The second question is whether the MRAI timer applies to
withdrawals.  I think it does (as I explain below), but Sriram
thinks not.


Geoff's diagram :

  http://www.potaroo.net/ispcol/2007-06/dampbgp.html
	
explains the pattern of behavior which results in a series of
longer best-path messages emanating out from the local part of
the network, followed by a withdrawal.

I am not sure about the time elapsed between the four sections
of Fig 1, which I will call A, B, C and D.

Geoff doesn't explain it in text, and Tony gives this explanation:


http://tools.ietf.org/html/draft-li-bgp-stability-01#section-1.3.1

  . . . when a prefix is withdrawn from the network, as the
  local BGP speaker receives the prefix withdrawal from a BGP
  peer it may substitute a previously received announcement
  for this prefix from a different peer in its place and
  announce this replacement route to its peers in response to
  the received withdrawal.  If this replacement path is
  subsequently withdrawn, the local BGP speaker will again
  select another previously received announcement from a
  different peer, if one exists.  This process may continue
  until the original prefix withdrawal has propagated across
  the entire routing space.

Here is my understanding, based on Geoff's diagram and Tony's
written explanation.  Is there a better explanation somewhere?

A   This is the steady state before T = 0.0, when (2) detects
    that its link to (1) goes down.

B, sort of . . .

T =  0.1 secs

   (2) sends withdrawals to (5) and (3).

   This is almost immediately after the link goes down - say
   a 0.1 second processing time after (2) determines that (1)
   is unreachable.  This is the time it takes to generate two
   withdrawal announcements.

   (3) gets the withdrawal too.  This is where my understanding
   departs from the diagram.  In B, (5) has sent an
   announcement but 3 hasn't.  I figure they would typically
   send their announcements at about the same time.

   In my understanding, at 0.1 seconds, both (5) and (3) have
   been sent the withdrawal from (2) but have not yet processed
   them.


T =  0.2 secs

    This is a 0.1 second processing delay after (5) gets the
    withdrawal  from (2), it decides that its current best path
    "5, 2, 1" is invalid and so adopts the next best path: "5,
    3, 2, 1".

    It announces this new, longer, best path (AA+) to its peers,
    and we are only interested in the announcement which goes
    out to other peer routers (upstream peers) to the top right
    of the diagram.  That AA+ triggers a cascade of such
    announcements through many other routers, perhaps across the
    entire Net, if there is no better alternative - which there
    isn't, according to this diagram.

    (5) sets its FIB to send the packets to (3) rather than (2).

      [ Sidebar on Root Cause Notification
      [
      [  My understanding of RCN (Root Cause Notification) is
      [  that with RCN, (5) would be able to tell that this new
      [  longer path is not going to work either, and neither
      [  would the other path(s) it knows about.  So with RCN
      [  (5) would now announce a withdrawal to its upstream
      [  peers, which is optimal behavior.
      [
      [  If these numbers identified routers, even without
      [  RCN, (5) could figure out that the "5, 3, 2, 1" path
      [  wouldn't work, because (2) has told (5) it can't
      [  reach (1).  However to minimise BGP messages and RIB
      [  state, these numbers identify Autonomous Systems, and
      [  there's no definite reason for (5) to assume that (3)
      [  can't reach (1) via some other router in AS (2) -
      [  although it does have to assume that AS 2 has internal
      [  routing problems if (3) can reach (1) via AS (2) but
      [  that 5's peer router in AS (2) can't.

   (5) sets its MRAI timer, so it will expire at T = 30.2
   seconds.

   (3) responds to the withdrawal it got from (2) by announcing
   a withdrawal to (5) and (4).  It doesn't use (4)'s path to
   (1) because that path flows through (3)'s own Autonomous
   System.

   (3)'s MRAI timer is also set to time out at T = 30.2 seconds
   but that does not concern us.


T =  0.3 secs

   (5) receives from (3) the withdrawal of the path to (1).
   At this point in time, (5) chooses the previously announced
   path via (4): "4, 3, 2, 1", and applies it to its FIB.
   However, (5) will not announce it because its MRAI timer has
   been set and will expire at T = 30.2 seconds.

   (4) sends a withdrawal announcement to (5).


T = 0.4 secs

   (5) processes the withdrawal it received from (4).

   Since there are no further paths available to (1) from (5),
   (5) changes its RIB state for this prefix to reflect the
   fact it has no such path - and it configures its FIB to drop
   packets addresses to the prefix.

   However, (5) does not announce anything because its MRAI
   timer is still running.


T = 30.2 secs

   (5)'s MRAI timer expires, so it sends the withdrawal to its
   upstream peers.


So in my understanding, there is a 30.2 second period of packets
being dropped, while for 30.1 seconds of this time (5) still
says it has a path "5, 3, 2, 1" for this prefix.  This may be
preferred by many other routers, even though they actually have
a longer, path (not shown) to (1).  So this may cause lost
connectivity to (1) for 30 seconds even though it has another,
much longer, link to other routers somewhere.

In my understanding, there is no strung out sequence of longer
and longer paths being announced by 5, as Geoff's diagram
depicts.  Just the first announcement at T = 0.1s of a longer
path, followed by a withdrawal at T = 30.2 seconds.  This does
not really resemble the "path hunting" pattern Geoff and Tony
refer to.


My understanding of what Geoff writes is that "path hunting"
occurs anyway, and is exacerbated in variations in the time
constant of the Minimum Route Advertisement Interval (MRAI)
Timer.  I am not sure if these are small variations of fractions
of a second from the default 30 seconds, or some other default
or configured values such as 20 to 50 seconds.  I don't see how
various values of MRAI in different routers would affect my
explanation of what happens in the situation depicted by Geoff's
diagram, other than to change the delay time to 20.2 or 50.2
seconds, for instance.

In an earlier message ("Geoff Huston's article ...") I wrote
about my understanding of Tony's Li's proposal for Path
Length Damping (PLD):

  http://tools.ietf.org/html/draft-li-bgp-stability-01#section-3.4

which involves a new algorithm looking for updates giving a
longer path than previously, (AA+ in Geoff's article) and
suppressing announcements regarding this prefix which also
extend the length of the best path, for a period of time.

I guess this would need to replace the MRAI timer's function, or
operate with longer time constants than that timer, but I don't
think Tony's I-D mentions this.

Tony's I-D

  http://tools.ietf.org/html/draft-li-bgp-stability-01#section-3.4

cites Geoff's article and includes:

  We therefore suggest that this event could be used to
  trigger a suppression period, which would allow the
  prefix to oscillate arbitrarily without propagation to
  the remainder of the network.

  When a BGP speaker found that it needed to change its best
  path for a prefix and that the new best path was longer than
  the previous best path, then it would suppress the
  advertisement of the longer AS path to its neighbor and start
  a timer.  Subsequent changes to the prefix that continue to
  lengthen the AS path would restart the timer.  If the BGP
  speaker installed a shorter best path, or undertook a local
  withdrawal it would remove the suppressed update and cancel
  the timer.  Otherwise, when the timer expired, the BGP
  speaker would advertise the result, if any.

This sounds like a great idea.  I understand that when applied
to Geoff's diagram, here is what would happen, assuming for the
moment that there was no MRAI timer:

T = 0.1 seconds

   (2) sends withdrawals to (5) and (3).


T = 0.2 seconds.

   (5) gets (2)'s withdrawal and does not announce anything,
   since its best path "3, 2, 1" is now longer than before,
   and Tony's (Geoff's?) PLD algorithm is invoked.

   (5)'s PLD timer is set to count down, lets say with a
   40 second time constant.  Absolute time of expiry will be
   40.2 secs.

   (5) changes its FIB to direct the packets to the next
   best path it knows, via (3).

   (3) gets (2)'s withdrawal too.  Since (3) has no other
    path, it announces a withdrawal to (5) and (4).


T = 0.3 seconds

   (5) gets a withdrawal from (3) and does not announce
   anything, since its best path is now longer than before:
   "4, 3, 2, 1".

   (5)'s PLD timer is reset and made to count down, lets say
   from 40 seconds.  Absolute time of expiry will be 40.3 secs.

   (5) changes its FIB to direct the packets to the next
   best path it knows, via (4).

   (4) gets (3)'s withdrawal too - and sends a withdrawal to
   (5).


T = 0.4 seconds

   (5) gets a withdrawal from (4).

   Because this eliminates the last path (5) had to (1):
   "4, 3, 2, 1", (5) needs to announce a withdrawal.

   It does so immediately, and disables the PLD timer.

   (5) configures its FIB to drop packets addressed to this
   prefix.


So with this PLD algorithm, as I understand it, there is only
one message sent by (5), and it is the correct one, sent at 0.4
seconds, not at 30.2 seconds as with current arrangements.

With PLD, (5) exhibits to its peers no false optimism about
delivering packets to (1), other than waiting for a very short
time (ensured by the PLD algorithm in the other routers, and by
the topology) to announce the withdrawal as soon as it was
certain it had was no path to (1).


Geoff wrote to the list:

> The withdrawal would propagate, or the path would stabilise.
> One way to  view it is as a more selective form of Min Route
> Advert Interval Timer.



Responding this and previous discussion of my message, Sriram wrote:

> My understanding is that MRAI does not apply to 
> explicit withdrawals (the AWs and WWs in your taxonomy).

However, my interpretation of the RFC:

  http://tools.ietf.org/html/rfc4271#section-9.2.1.1

is that the MinRouteAdvertisementIntervalTimer (MRAI Timer),
which defaults to 30 seconds (Section 10), applies to all
announcements, with no exemption for withdrawals.

  Two UPDATE messages sent by a BGP speaker to a peer that
  advertise feasible routes and/or withdrawal of unfeasible
  routes to some common set of destinations MUST be separated
  by at least MinRouteAdvertisementIntervalTimer.

So I think Sriram's understanding is incorrect.

> Did you happen to verify this based on the observed data?
> The reason I am asking is if this were implemented,
> then in your illustration of Fig. 1 (path hunting),
> all the withdraws (W) will make it quickly to Router 5.

I am not sure whether this last line refers to "this" being the
implementation of PLD or the way he thinks BGP routers already
behave, with the MRAI timer not stopping withdrawal messages.

However, the behavior Sriram goes on to describe accords with my
own interpretation, as described above, of how the routers would
behave with current BGP.


> BGP Router 5 may have sent one AA+ in response to the very first withdraw
> from Router 2 but any subsequent AA+ it tries to send will be
> held back for MRAI interval and during that time the
> other withdraws from Routers 3 and 4 will make it to Router 5
> (because they are not subjected to MRAI), 

So far, I agree with Sriram's interpretation.  I understand his
"not subjected to MRAI" as meaning that the withdrawal cascade
from (2) to (3) to (4) propagates rapidly (0.1 sec per step in
my explanation) because this is the first change regarding this
prefix in quite a long time.

However, I think Sriram believes a second, or perhaps the only
reason, the cascade is so quick is because he believes that the
MRAI timer does not apply to withdrawals.

> and hence Router 5
> will send only one AA+ and then a withdraw.

If the MRAI timer did not stop withdrawals, then the outcome
would be the same as for my understanding of what would happen
with Tony's PLD algorithm: (5) would propagate the withdrawal at
T = 0.4 seconds.

> If "not applying MRAI to explicit withdrawals" is implemented,
> then withdrawals propagate much faster than AA+ or AA0 updates, 
> hence path hunting related updates would be significantly reduced.
> I would think this is how MRAI was intended to expedite
> BGP convergence and reduce churn.

This may have been the intention, but to my understanding, the
MRAI timer delays withdrawals so they are announced at least 30
seconds after all previous announcements.  The highly meshed
topology of Geoff's diagram, which is typical of the Internet,
ensures that (5) first announces an AA+ and then waits 30
seconds before announcing the withdrawal.

Here is the summary of my current understanding.  Please correct
 it!

Geoff's diagrams don't give enough timing detail for me to
critique properly, but I think that with current BGP behavior,
the outcome for the rest of the BGP routers is really bad:

T =  0.2 sec       AA+ longer path announcement
T = 30.2 sec       Withdrawal.


Sriram seems to believe MRAI was developed with an intent to
achieve this, and thinks the RFC specifies that it should not
stop withdrawals.  But my understanding of the RFC is that MRAI
does delay withdrawals, in any circumstance like (5)'s, where it
has a next-best path which it announces first, because it hasn't
yet received a withdrawal from (3) or (4).

Tony, you co-wrote the RFC - what do you think?


   - Robin          http://www.firstpr.com.au/ip/ivip/


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg