[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A new spin on multihoming: multihoming classes.



On Sat, 8 Sep 2001, Michael Richardson wrote:

> 
> >>>>> "Peter" == Peter Tattam <peter@jazz-1.trumpet.com.au> writes:
>     Peter> On Fri, 7 Sep 2001, Geoff Huston wrote:
>     >> But as to the assertion that 8,192 is some magical preferred number of
>     >> 'root prefixes', then I'm not sure that I can agree.
> 
>     Peter> I am concerned that the whole multihoming issue hinges on the
>     Peter> answer to whether
>     Peter> BGP can be made to work with larger DFZ than we anticipated.
> 
>     Peter> I'm not sure where 8K comes from.  8K = 2^13.
> 
>     Peter> Maybe it relates to the size of a practical switching table in very fast
>     Peter> routers built using ASICs.
> 
>   No.
>   Currently shipping OC-48 capable ASICs do 100K+ easily (for IPv4).
>   Announced OC-192 ASICs do the same or more, and most designs have roadmaps
> to OC-768, assuming that the OIF/NPF people finish defining appropriate
> electrical interfaces to get packets between ASICs at that rate :-)
> 
>   Some solutions scale as the number of bits that matter. I.e. a single /128
> entry simply takes twice as much space as two /64s, while other solutions
> must store as many bits as the worst case length. The pathological situation
> (highest price, highest power dissipation) is ternary CAMs which seem to do
> 72 bits or 144 bits only as options.
> 

Ok.  Point taken.  Flies in the face of the RFC recommendations mentioned in
other docs.

So we have the technology to make large routing crossbars.  I guess the issue
is what's feeding that crossbars.  Am I right in guessing that a lot of the BGP
stuff happens as a background process in the routers, and it's managing that
BGP mesh in a stable manner that's at issues.  Can one assume that technology
will keep track of the expansion of the internet necessary?

At issue is a big question as to whether BGP will scale *across the whole
network* . I haven't seen a clear informed answer to this yet.   Many of the
studies I've seen refer only to empirical studies and analysis of the current
v4 network and predictions of what is likely in the next year or so.

One problem that I see (correct me if I'm wrong) with the current v4 BGP system
is that *IF* there is a BGP storm it affects *everyone* whether they are
multihomed or not.  In other words, those sites which aren't multihomed are
penalized.   Stability of the core is then threatened merely by the act of
allowing multihoming via long prefixes.

It would be a very worthy post grad thesis to do a realistic simulation of the
BGP system to see exactly what happens when it is scaled up in size - there
must be one around I'm sure, but it might need a supercomputer to do it. 

I guess it comes down to fundamentals.  For a router based solution, you need
to fill the DFZ with all the possible MH information for every MH site in the
world and intuitively we think this can't possibly scale.  To date the ratio of
MH sites to non-MH sites has been low - kept that way because of availability &
cost. However with the advent of widespread always on connectivity and
bandwidth (e.g. Cable, DSL, 802.11) the potential availablility of MH is likely
to increase that ratio faster than would have been initially predicted.  The
concern is that even if a core router could switch fast enough & with what
seems a reasoanble number of routing table entries, the overengineering
required for a given router lifetime is difficult to predict with the current
BGP based MH solution.

The suggested IPv6 solution is to limit the choices that the DFZ routers need
to make and push the MH decision making to the edges.  By doing so, you throw
away the MH information visible at the core and you have to regain it another
way (which I've said before).

Once thing we do know scales reasonably well is that of DNS.  It has the job of
providing name to address mapping for almost every address on the planet and it
does it by virtue of localization of information & caching.  It has a few
bottle necks I suspect where there are heavily loaded domain names (e.g. .com) 
but it survives because it can be managed with technology which is cheaper han
routers and where memory cost integral is reducing at a much faster rate I
suspect than router memory and ASICs. 

So I can only think that to achieve the scalability we want, we still need to
work on decentralizing the MH decision making no matter how confident we think
that core routers are going to cope.

Finally I might add that I have to concede there is one fundamental flaw in the
host based or DNS based solutions to MH and that was eloquently pointed out by
someone else.  DNS needs routing so routing can't depend on DNS which means we
can't use it as it is currently implemented, primarily because DNS is agnostic
towards routing conditions.

This implies that you need a globally reachable sub system that is independent
of MH conditions which will then supply the MH information for any site based
or transport based MH solutions to function.   Very likely it could borrow
heavily from DSN, but it would need to be redesigned to be resilient to MH
conditions which I believe DNS is not.

My final conclusion is that we need a reachability caching system somewhat like
DNS but which is reachability aware.  Traditionally BGP has met this target but
this places the burden on routers to manage it.

An idea springs to mind - why not have an alternative system based on BGP
*independent* of the routing system from which reachability information can be
gathered.  Such a BGP system would need to be extended so that whole DFZ would
not need to be kept in the BGP server, but rather the localized routing needs
of the site instead, much in the same way that a DNS server only keeps the
localized name mapping that it requires for the sites current connections.

Such a system would meet the needs of keeping the *DFZ of core routers* small,
provide information for load balancing, retain a strong aggregation structure,
reduce instability of the core by forbidding advertising long prefixes.  I
believe it could be executed on hardware comparable to that which DNS uses.
Such a service could be referenced by node based multihoming solutions to
provide an accurate and timely source of information for such solutions.

If this has been thought of before I completely apologize to the original
proposer.

Peter

-- 
Peter R. Tattam peter@trumpet.com Managing Director, Trumpet Software
International Pty Ltd Hobart, Australia, Ph. +61-3-6245-0220, Fax
+61-3-62450210