[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Opportunistic Topological Aggregation in the RIB->FIB Calculation?



On Mon, Jul 21, 2008 at 9:51 PM, Tony Li <tony.li@tony.li> wrote:
> http://www.iab.org/about/workshops/routingandaddressing/Router_Scalability.pdf

Tony,

Let's bound the problem a little better, shall we?

First, we're not worried about routine churn as a result of changes in
traffic engineering and far-away failures. The current rate peaks low.
100 times that rate is readily doable.

Second, we're not worried about large churn due to planned nearby link
changes such as intentionally removing a link from service, restoring
a link to service or adding a new link. While it would be inconvenient
for that to take more than a few minutes, it could be managed in a way
that would not increase outages even if it took hours overall.

That excludes all but two problematic cases:

1. Catastrophic churn due to the failure of an attached link. A node
faced with this situation must quickly rebuild the FIB with new
entries for all routes which used that exit. It must also quickly get
the message out to nearby nodes that it no longer has as high a
priority path to those routes and indeed may have no route at all.

2. Major churn on the few tens of "nearby" nodes whose topology
relative to the failed link requires changing many FIB entries. Note I
said FIB entries, not RIB entries. It doesn't matter how long it takes
to update the RIB with the new priorities except where those
priorities cause a different decision about what's placed in the FIB.

So, how can we address this within engineering and operations?

A. If the failed link is internal, MPLS recovers for us without
disrupting the BGP RIB.

B. If the failed link is external, an intentionally programmed tunnel
designated as a "backup" to the interface could usually let us carry
on routing packets while the RIB recomputes optimal routes, the same
as MPLS allows internally. Recall, the FIB lookup returns an index
into an Exits table. If a tunnel link is designed as a full backup and
the tunnel encoding is implemented in the ASICs (like MPLS is) then
it's a trivial matter to just change the exit in the one table.

C. Even if it isn't practical to build such a tunnel, you can
generally pick an alternate link that offers a high probability of
reachability for for the impacted routes. On link failure, cut over to
the alternate while rebuilding the RIB and then FIB.

D. Building on C, create basic loop detection in the forwarding
engine. If the nearby router sees a packet come to him with a
destination that he thinks should go back to the person who sent it,
this should provide early indication that his route is faulty. He can
(and probably should) reduce the priority of that route to a level
below the next best route immediately rather than waiting for
confirmation in a BGP update.

None of this is perfect, but put together it enables a system that
isn't on the brink of collapse at 100 times the current number of
entries.

Let's talk about hardware next.

> In fact, the PC hardware doesn't actually do that.  What we really see is
> that DRAM memory speed grows at about 1.2X every two years and that our
> growth rate is at least 1.3X every two years.

That figure sounds fishy. As I recall, we were using a 100mhz memory
bus in 1998 and moving in to 133mhz memory bus. Today we're using a
1300mhz memory bus and moving to a 1600mhz memory bus. Math may not be
my first love, but I'm pretty sure doubling every 3 years (like I
said) would give us 800 mhz in 9 years, 1600 mhz in 12 years and
somewhere between the two in 10 years.

Granted the speed of the individual DRAM cells themselves isn't
increasing quickly, but what matters is the aggregate speed with which
a range of memory becomes accessible to the CPU (a cache fill).

Moving on, for a major recomputation of the RIB (which we decided was
the problem case), the recomputation can be performed in a
cache-efficient manner. When we know there's a big change, we don't
have to use the same algorithm we use for one-off changes with routine
churn. This means that for the most part we'll be handling the problem
case at SRAM speeds rather than DRAM speeds.

The task can also be broken up and performed in parallel. This means
that IF memory access can be done in parallel, the task can be
significantly sped up.

Is there any cheap COTS hardware that implements such parallel memory
access? YES, THERE IS.

AMD Opteron processors embed the memory controller in the CPU. Each
CPU manages its own bank of memory with a dedicated memory bus. They
share with each other via a "hypertransport." If the portion of the
RIB associated with part of the address space is intentionally placed
in memory managed by the CPU which will calculate that portion of the
RIB then the computation can proceed in parallel without contention on
the memory bus.

2-way opteron systems are commonplace while 8-way systems are findable
with a google search. That's an instant 8x boost in the catastrophic
RIB recalculation speed with COTS hardware just by parallelizing the
algorithm and managing memory a little better.


In summary:

* 8x boost to your top capacity estimate is cost effective and
immediately available in existing COTS hardware.
* Only have to solve the nearby link-failure case to sustain BGP beyond that.
* Tunnels let us avoid outages from the nearby link-failure case most
of the time.
* Other algorithmic tweaks that don't change the protocol drastically
reduce the outage associated with the nearby link-failure case.

We'll still eventually hit a wall on the BGP RIB, but the cost won't
reach an untenable level until after we've passed 100x the current
size. We'll hit the wall on the FIB much sooner, perhaps as little as
10x.


> Those matters, since they are implementation details that actually have
> private solutions known to some in the industry are really not our
> balliwick.  Again, our charter is to address the routing and addressing
> architectural issues that cause a lack of aggregation in the first place.

I'll have to think about that one further but I'm sure it has an
appropriate answer as well.


On Tue, Jul 22, 2008 at 5:49 AM, Iljitsch van Beijnum
<iljitsch@muada.com> wrote:
> 100 times the entries * 100 times the churn = 10000 times the processing.
> I'm afraid that your DRAM isn't going to keep up with that in a traditional
> design.

You have the combinatorics wrong. Each entry has some probability of
churning each second. So if you have 100 entries, you're 100 times as
likely to see a single-entry churn event. You are NOT 100 times as
likely to see a 100-entry churn event. In fact, you're no more likely
to see a full-table churn event that you were when there was only 1
entry, and each such full-table churn consumes only 100 times the
processing.

The net result is that 100 times the entries ~= 100 times the churn ~=
100 times the processing. NOT 10,000.

Regards,
Bill Herrin


-- 
William D. Herrin ................ herrin@dirtside.com bill@herrin.us
3005 Crane Dr. ...................... Web: <http://bill.herrin.us/>
Falls Church, VA 22042-3004

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg