Re: [RRG] Solving the routing scaling problem for internal networks too? Ivi...

Robin,

Be assured that TARA can be done such that each single next hop, no matter whether intra- or inter-domain, is a matter of just one single table lookup. 75600 table entries are sufficient to cope with a density where two routers are away from each other by a half yard ( IMO 72 000 should be sufficient too).

I have never heard a convincing argument why intra- and inter-domain routing have to be such orthogonal as is the case today.

TARA's routing table could be derived from viewing a topology where each router is immediately surrounded by a strict network followed by loose and looser networks according to the geographical remoteness,

PLUS (!!!) an extension of the strict network which is the entire intra-domain network without geographical limitations.

I would be grateful if you could provide an estimation how much faster a packet may be forwarded via an average (intra-and inter-domain) path by means of a single table-offset rather than a binary search across a classical FIB at each router.

Heiner

In einer eMail vom 30.09.2008 09:47:15 Westeuropäische Normalzeit schreibt rw@firstpr.com.au:

Short version:   To what extent do currently proposed solutions for
   the DFZ routing scaling problem help with
   scaling problems in large internal networks?

   Not at all, as far as I can tell.

   I guess the scaling problems of internal networks
   are of quite a different nature to the problems
   faced by BGP routers in the DFZ.

   I explore how Ivip's encapsulation and two
   Forwarding approaches could be used to help
   with internal routing systems, in addition to
   or independently of their use in the DFZ.

   The Forwarding approaches have no overhead -
   which is a significant problem for IPv6 map-encap,
   especially for VoIP packets. They are 100%
   efficient and involve no PMTUD problems.

In "Re: 2 billion IP cellphones in 2103 & mass adoption of IPv6 by
current IPv4 users" http://psg.com/lists/rrg/2008/msg02594.html
Wesley George wrote about the scaling problem in internal networks,
wondering whether the solution we are seeking for the DFZ scaling
problem will help with this.

Referring initially to the scaling problem in the IPv6 DFZ
developing later than the scaling problem of internal networks of 3G
cellphone operators, he wrote:

> Admittedly the problem may be further off in the DFZ. However, I
> don't know why we would design something that only applies to the
> DFZ, since the route scale problem has potential to be much worse
> within a given network than outside it.

I wonder how many ISP and end-user networks have more internal
routes than the DFZ's 260k?

I understand that internal routing systems use OSPF or IS-IS, which
operate on completely different principles to BGP, which is entirely
decentralised.

The scaling properties of these internal systems are presumably very
different from BGP. I think it is true to say that BGP suits the
interdomain core better than OSPF or IS-IS, because BGP can work
fine in a system with no central coordination, even when the size of
the network is not known, while the other two assume a centrally
administrated and carefully managed network?

I understand that in terms of the forwarding (FIB) part of the
router, every extra route is another load on the system like any
other. I understand that the FIB doesn't distinguish between routes
which come from BGP and those from the IGP (OSPF or IS-IS). Despite
the heroics necessary to classify and forward packets at 1 and 10
GBps for millions of prefixes, I understand this is not the primary
problem with the routing scaling problem.

The RIB control plane scaling problems affecting any one BGP router
include:

1 - Amount of RAM and CPU power required to handle a given number
of prefixes. This scales directly with number of prefixes
multiplied by the number of neighbour routers, since the
router needs to conduct a separate conversation with each
neighbour about each prefix.

2 - Traffic requirements for the routing protocol. I figure this
is a relatively minor concern.

3 - Difficulties with the CPU and RAM coping with large floods of
changed routes, such as when one nearby or distant link or
router goes down, or comes up, affecting hundreds of thousands
of prefixes.

4 - Point 3 manifests in the router as excessive delays in
adjusting its best path to the changed conditions.

As the number of prefixes rises, and as the rate of updates rise, in
order to keep to certain standards of correct and rapid operation,
each router needs to be upgraded at great expense, either with more
RAM or perhaps with a complete replacement. Alternatively, any
given router may need to be restricted to operating with fewer
neighbours than it would be if the number of prefixes had not grown
so much.

So these points add up to a major financial burden for any network
operating DFZ routers.

Also, the scaling problems for the whole BGP network include:

5 - Slower propagation of updates - slower response to outages
and therefore more packets going to black holes during a
major change to the topology of the network.

6 - Greater concern about the stability and robustness of the
whole network, considering that no-one really understands it.
No-one even knows for sure the structure of the network or
how many DFZ and single-homed BGP routers there are.
How is the overall behavior affected by some routers passing
on changes much slower then others? It is hard to estimate,
but in general it can't be good.

I understand that some networks have millions of internal routes. I
assume this is sustainable - so OSPF or IS-IS presumably scales
somewhat better than BGP. Part of this ability to handle larger
numbers of prefixes is probably due to the internal routing system
being more controlled than the DFZ. In the DFZ, there is no control
or influence on the rate of updates arriving from neighbouring
networks. Other than crude filtering to the point of ignoring some
of them, and potentially upsetting connectivity to some parts of the
Net, a DFZ router needs to respond to them all.

In these very large internal networks, what are all these routes
for? Is it all internal stuff for the ISP/carrier? Does it carry a
large number of routes for PA customer networks?   This is probably
too big a question to answer in the RRG. Can anyone point me to
resources concerning this?

I am not convinced that the internal network's scaling problems are
identical or even close to those of the DFZ. Even if they were
identical, I would argue that they are not so much of a concern to
us as the DFZ:

Firstly, it is a conscious decision by administrators to make a
network so big that it has a million or more internal routes.
No-one is forcing them to do this, or is saying it is a good idea.

Only a small proportion of ISP and end-user networks have such
numbers of internal routes, and it is probably a fairly low priority
for the IETF's to reduce the costs of such large organisations.

Secondly, since these internal networks are fully managed by the
organisation, including presumably the ability to reduce the updates
sent by any one router, the scaling problems and stability
difficulties can be controlled in this way, rather than by changing
protocols or spending more money on routers.

Since any substantial ISP must deploy routers in the DFZ, the costs
of doing so, and the instability and long convergence time problems
resulting from the growing size of the DFZ routing table are major
barriers to any ISP operating. Therefore, the DFZ scaling problem
has a pervasive impact on the cost and quality of all Internet
communications. The same goes for any end-user network which wants
or needs portability and multihoming. This is a very high priority
for the IETF.

Even if we accept that the scaling problems of internal networks are
very different from those of BGP in the DFZ and even if we decide it
is not our concern if internal routing systems have scaling
problems, we might still want to consider how our proposed routing
scaling solution would help or otherwise with the internal routing
scaling problem. We need to convince ISPs and both large and small
to invest in changes to routing and addressing, this looks like an
important question:

> What incentive do I have
> as an operator to deploy some fantastic new thing for the DFZ if
> I still have to have routers that cost millions to deal with my
> internal network routing table?

In the APT business model, as described in a recent message from
Michael Meisel:

http://psg.com/lists/rrg/2008/msg02589.html

The decision to adopt APT is taken by the ISP, for the ISP's own
immediate and lasting benefit: improved efficiency in some way. (I
don't know how this is achieved, and I don't understand how an ISP
could do this without checking with the end-user network whose space
is being converted to an APT EID and so will be withdrawn sooner or
later from the DFZ.)

In the Ivip model:

Re: Comparing APT & Ivip - new business models
http://psg.com/lists/rrg/2008/msg02593.html

ISPs are not necessarily the primary driving force behind Ivip
adoption. They will be the most direct beneficiary of Ivip's effect
of reducing or eliminating the routing scaling problem - and their
lower costs will be passed onto all Internet users.

However, on a network-by-network basis, the impetus for adopting
Ivip-managed SPI address space, or for converting an existing PI
prefix to SPI space, will mainly come from the end-user networks
whose space this is. Existing PI end-user networks would seek
either better flexibility (many more micronets, including down to
single IPv4 addresses, with potentially fast and frequent mapping
changes to implement real-time load sharing) by converting their
prefix to SPI space. Alternatively, by relinquishing their PI space
and BGP expertise, and renting a probably smaller, and therefore
probably cheaper, amount of SPI space from a MAB operating company,
they would achieve improved flexibility and reduced costs.

End user networks with PA space will be motivated to adopt Ivip by
the desire for portable, multihomeable, space - making them
independent of any one particular ISP. Their current ISP is
unlikely to push them to adopt Ivip, except in the hope of keeping
them as current customers, rather than them doing on their own, or
at the urging of a competing ISP.

The direct benefit of Ivip to ISPs comes slowly, as the DFZ routing
table either drops in size, or at least doesn't grow as fast as it
otherwise would have. ISPs may in general want every end-user
network to adopt Ivip, for this reason. However, any PA end-user
customer of theirs which adopts Ivip will be less tied to this ISP
than before, because they can now multihome their new SPI space with
another ISP, or leave the current ISP entirely.

So ISPs might in the short term be unmotivated to deploy Ivip
themselves, except as required to meet the needs of their customers
who want to use it. An ISP's interests might be served well by
letting all other ISPs and end-user networks adopt Ivip, while
itself doing nothing and keeping its current PA customers. However,
competition ensures that such a complacent approach would lead to
loss of customers.

> Assuming that always-on IP-enabled applications continue taking
> off, I have ~55M handsets to address. Accepting in the short term
> (5 yrs or so) there will be some significant amount of IPv4-only
> devices, as those age out, the IPv6 table continues to grow in my
> network. The DFZ may not have to see much of that except in some
> mobility cases (depends on implementation), but you can't argue
> with the idea that even with well-built address hierarchies, some
> routers in the network are going to have to deal with orders of
> magnitude more routes than they do today. What better place to
> test out a new scalable routing infrastructure than in a
> controllable network before it has to be implemented by the DFZ
> across multiple networks?

The RRG's charter is purely to deal with the scaling problem in the DFZ.

As currently presented, the core-edge separation schemes: LISP, APT,
Ivip, TRRP and Six/One Router are all intended to relieve pressure
on the DFZ core by enabling end-user networks to have a new kind of
PI space, which I call Scalable PI space, without each SPI prefix
appearing in the DFZ.

In a recent message:

Re: Comparing APT & Ivip
http://psg.com/lists/rrg/2008/msg02589.html

Michael Meisel wrote of APT:

   Below, you describe your doubts about how a single ISP could
   deploy APT unilaterally, without the involvement of their
   customers. Allowing for this, and giving ISPs an incentive to do
   so, is perhaps *the* primary goal of our incremental deployment
   scheme. As I mentioned before, we should have a new document
   describing the updated details sometime in the next few months.
   But, to summarize, you can think of a single ISP deploying APT
   as similar (in concept) to a single ISP deploying MPLS, or
   some other internal efficiency improvement. The difference is,
   APT allows for a potential increase in benefits with every other
   ISP that deploys it.

As I wrote in that thread, I don't understand how APT could be
deployed by an ISP without coordinating with the end-user network.
However, Michael indicates that APT could improve efficiency within
the ISP network. If so, then perhaps it helps in some way with the
scaling problem of the internal network, perhaps in terms of the
number of BGP routes its internal BGP system carries.

Ivip, as currently described, does not aim to help with the scaling
problems of large internal networks. Here I will explore how Ivip
might be used to do this.

While Ivip began as a map-encap scheme - like LISP, APT and TRRP -
the long term goal for Ivip is to use Forwarding, instead of
encapsulation:

   ETR Address Forwarding (EAF) - for IPv4
   http://tools.ietf.org/html/draft-whittle-ivip4-etr-addr-forw-01

   Prefix Label Forwarding (PLF) - for IPv6
   http://www.firstpr.com.au/ip/ivip/ivip6/

These both have two major advantages - no encapsulation overhead and
no need for greater ITR and ETR complexity with extra protocols and
probing etc. to solve the Path MTU Discovery (PMTUD) problems
inherent in map-encap:

http://www.firstpr.com.au/ip/ivip/pmtud-frag/

These two Forwarding schemes operate in different ways. Both make
use of currently un-used, or little-used, bits in the existing IPv4
and IPv6 header. They both require upgraded FIB functions in DFZ
routers, and to some extent in internal routers. The PLF approach
for IPv6 also requires a small change to the RIB. Neither involves
new routing protocols or any change to the BGP (or internal routing
protocol) implementation.

Perhaps it will be easiest to implement these changes in routers and
then deploy Ivip purely on a forwarding basis. Otherwise, we need
to devise and introduce the more complex map-encap approach - and
upgrade progressively to Forwarding in the longer-term future.

Here is how Ivip might be used to help with the scaling problems of
internal networks.

Firstly, considering IPv4 and IPv6 done purely with encapsulation:
This requires no changes to DFZ or internal routers, so can be
deployed by adding ITRs, ETRs and a mapping system.

I will consider the potential for helping with the scaling problems
of both large ISPs / telco-mobile carriers and big end-user
organisations, such as large universities, governments and
corporations. I will refer to these as Big networks.

In the standard Ivip arrangement, the Big network has multiple full
database query servers (QSDs). These all receive a full, real-time,
flow of mapping updates from the global fast-push mapping system:

http://tools.ietf.org/html/draft-whittle-ivip-db-fast-push

ITRs in the Big network send queries to these QSDs and get mapping
replies very quickly and reliably (like APT's ITRs and Default
Mappers, and much faster and more reliably than with LISP-ALT's or
TRRP's global distribution of potentially millions of query servers.

ETRs can be located on any conventionally BGP managed address - not
on SPI address. This means for a Big end-user network to have its
own internal ETRs, it must have some part of its network running
with either its own conventional PI address space, or perhaps with
some PA space it gets from one or more ISPs. For instance, a
multihomed-user network could have ETRs in its network, on two
separate pieces of address space, one from each of its two upstream
ISPs.

ITRs can be on any public address - conventional BGP managed or SPI.
They can't ordinarily be behind NAT, since they need to be able to
receive mapping updates from a nearby QSD for some prefix they
recently requested mapping for.

Here is how such a Big network could use Ivip to reduce the number
of prefixes in its internal routing table. The Big network would
establish its own internal mapping system, to generate mapping for
internal micronets, and to map them to any of its internal ETRs.
Actually, each internal micronet could be mapped to any ETR in the
world - its just that the system would only catch packets sent to
these micronets from within this Big network, or from within any
other network which also used this second set of internal mapping
information to drive its QSDs and ITRs.

The QSDs would all be sent this internal mapping information in
addition to the global feed, as sent to all QSDs everywhere.

Caching ITRs (and any full database ITRs, which are effectively a
caching ITRs coupled directly to a QSD) would then be able to
encapsulate packets sent within the internal routing system and
tunnel them to whatever ETR was specified in the mapping.

With an internal fast-push mapping system - without the need for
multiple RUASes or the Launch server system - this mapping could
probably be pushed to all internal QSDs in a fraction of a second.
I don't know how responsive OSPF or IS-IS is in large networks, but
perhaps this internal Ivip mapping system would be more responsive
than these internal routing systems. ITR behaviour could probably
be changed in less than a second.

Assuming that most routers remain as they are now, and that ITRs are
either hardware based routers (Cisco and Juniper style) or specially
programmed COTS hosts, then there needs to be a way to attract raw
packets to these ITRs when their destination address matches the
prefix of an Internal MAB (Mapped Address Block) of Big network
Ivip-managed space. (I probably need another term for this other
than SPI space.)

So we need the concept of an IMAB in addition to the global MABs
which the global Ivip system manages. These ITRs would be an
internal equivalent of the Open ITRs in the DFZ (OITRDs) - maybe
call them Open Internal ITRs (OIITRs) or Open ITRs in the Internal
network (OITRIs). These are all tongue-twisters . . .

For internal purposes only, ITRs and ETRs could probably be on
private addresses too. Likewise, private address space could be
managed by this internal Ivip system.

In this way, the internal routing system can handle a much smaller
number of prefixes, since most of the prefixes the internal routing
system currently handles could be done with the internal mapping
system and internal ITRs and ETRs.

The destination ETR for an internal micronet doesn't have to be an
internal ETR. It could be any ETR in the world. Maybe this could
help link the network to other networks. This is getting complex,
but these are optional complexities and this should be expected in
any flexible, useful, TCP/IP routing and addressing scheme.

The PMTUD problems inherent in map-encap may be significantly
reduced in this internal application of Ivip, because the network
administrators may be able to ensure that all devices between the
ITRs and the ETR have MTUs of 9000 bytes or so. That doesn't
necessarily solve PMTU problems for an application which thinks it
can send packets of this length, when those packets are encapsulated
and become longer than 9000 bytes - and the encapsulation makes a
mess of a Packet Too Big message which is created by a router in the
ITR -> ETR path. However, there may be ways in an internal network
for handling this which are simpler than those required in
interdomain routing. For instance, every ITR could reject a packet
of length > (9000 minus encapsulation overhead), if it can be
assured that the MTU to the ETR which is known to be in the Big
network is exactly 9000. Upgrading everything to Gigabit Ethernet
and 9000 byte MTU would be quite a task, but over a few years as
part of the upgrade to an internal Ivip system, the whole network
could be significantly streamlined.

Now consider Ivip operating with ETR Address Forwarding (EAF) for
IPv4. The internal routing system would need to have most or all
its routers upgraded to forward packets according to the 30
available bits in the IPv4 header when the header is in the new
format, signified by the "Evil bit" being set to 1.

Maybe some smaller, older, routers near the periphery of the network
are not upgraded in this way. So within the network there is a
"Forwarding Upgraded Zone for IPv4" (FUZv4) where Forwarding based
ITRs and ETRs can operate freely, without any PMTUD problems. This
means they have 1500 or 9000 byte or whatever PMTUs and handle
traffic packets of these lengths, with no encapsulation overhead.
PMTUD would operate normally with these routers, including when
these routers are between the ITR and the ETR - the sending host's
application adjusts the packet length to suit the total PMTU between
it and the destination host.

ITRs can have their packets forwarded to any ETR, and the ETRs must
be located on addresses such as x.x.x.0, x.x.x.4, x.x.x.8 etc. due
to the use of only 30 bits for a forwarding address, rather than the
ideal of 32 bits.

This would probably work really well. There is no change to the RIB
of the internal routers. They still work with OSPF or IS-IS and
their RIB contains the same information. Its just that instead of
forwarding the packet based on its destination address, the new kind
of packet (with the Evil Bit set to 1) is forwarded according to the
30 bits in the header which were previously used for fragmentation
offset and checksum.

There are some restrictions on sending fragmentable packets which
are longer than some limit - see:

http://tools.ietf.org/html/draft-whittle-ivip4-etr-addr-forw-01#section-5

Since the internal network could have different PMTU characteristics
than those assumed when setting the global MinCoreMTU value for the
whole of the Ivip system, perhaps internal micronets could be less
restricted in terms of fragmentable packets. However, overall, I
think it is best to discourage the use of fragmentable packets.

The IPv6 Forwarding approach operates on different principles from
the IPv4 approach. With ETR Address Forwarding (EAF) for IPv4 there
are 30 bits available - enough to uniquely identify every ETR.

With Prefix Label Forwarding (PLF) for IPv6, there are only 20 bits
available. The current plan is to use have half of these codepoints
(524,287) used in the global IPv6 Ivip system to identify the BGP
prefix by which the ETR can be reached. On arrival at the border
router which advertises that prefix, a second operation is required
if there is more than one ETR in that prefix. The mapping needs to
be looked up again, for the destination address of the packet, and
the result of the mapping lookup (the ETR's exact address) may
result in the packet being sent to the ETR by one of several methods:

1 - Forwarded over a direct link to the ETR.

2 - Encapsulated to the ETR (but this raises the PMTUD problems
again, which Forwarding otherwise avoids).

3 - A similar approach to Forwarding across the DFZ, based on the 20
bits in the header, but this time using the other 524,287 code
points, which internal routers in the ISP network map to
prefixes handled by the internal routing system.

This is more complex than the IPv4 approach, but it is the best we
can do with IPv6's horrendously long 128 bit addresses for ETRs,
when we have only 20 bits to play with.   While remaining compatible
with the existing IPv6 header size, this looks like the only way of
avoiding encapsulation and its PMTUD problems.

How could this system be used to help with the scaling of internal
networks?

As with the IPv4 approach, the network would have its own internal
mapping system and its QSDs (and therefore its ITRs) would work from
this mapping information as well as from the global mapping feed.

The above system provides 524,287 micronets which could replace
existing internal prefixes, and be mapped to any internal ETR.

Perhaps these internal micronets could be remapped with lower costs
and higher speeds than are possible with these being prefixes in the
internal routing system. If so, this would be benefit in addition
to removing this number of prefixes from the internal routing system
- though they do need to be covered by some much smaller number of
Internal Mapped Address Blocks.

If internal routing systems are currently coping with more than a
million internal routes, and this is somehow not regarded as causing
a serious scaling problem, then it looks like this IPv6 approach to
forwarding in Ivip is not going to make such a difference, since it
could at best reduce the load by half a million prefixes.   But if
these prefixes were frequently changing, and this was a serious
burden for the current internal routing system, perhaps it would be
a worthwhile approach.

As with the IPv4 approach, all paths between ITRs and ETRs need to
have routers upgraded to handle the new header format. With a Big
network, this gives rise to a Forwarding Upgraded Zone (FUZv6) - the
part or parts of the network where all routers handle the new
Forwarding format of the IPv6 header. These upgrades are more
complex than the upgrade for IPv4 forwarding. Still, I guess they
might be done with a software update for existing routers. The IPv6
approach to Forwarding require some further management compared to
the more direct IPv4 approach.

For the 524,287 interdomain code-points in Prefix Label Forwarding
(PLF) for IPv6, I propose that there be a direct, globally assumed,
mapping between each of these code-points and a particular "Core
Egress Prefix" all of which are in a contiguous block. For
instance, they would match:

CEP-0    E000:0000::/32
CEP-1    E000:0001::/32
CEP-2    E000:0002::/32
. . .
CEP-524287 E007:FFFF::/32

For the internal 524,287 code-points, it is up to the network
administrators which code-point matches which prefix. So all the
routers would need to be configured in the same way. There could be
some range of codepoints assigned to one set of contiguous prefixes
in one part of the IPv6 address space, and other ranges of
code-points assigned to other parts.

So both the encapsulation and the two Forwarding approaches to Ivip
could be used to help internal networks scale.

These techniques do not absolutely depend on the development of a
global Ivip system. One or more big ISPs or end-user network
operators could, in principle, talk to their router vendor(s) and
ask them to develop ITR, ETR and mapping systems to work just within
their networks. The Forwarding approaches would require software
updates to existing routers, but the ITR and ETR functions would be
simpler than for map-encap due to the lack of PMTUD problems.
Router vendors would tend to implement new features in their
existing devices, but it would also be possible to implement ITR,
ETR and mapping distribution functions in software for ordinary COTS
hosts.

I suspect that the scaling problems of internal routing systems are
not yet so pressing as to prompt a development like this, but at
least there is some prospect for synergies between using these
techniques internally as well as for their original purpose - to
help solve the interdomain routing scaling problem.

- Robin

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg