[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RRG] Scaling: ad-hoc to unified or unified from the start?

To: Routing Research Group <rrg@psg.com>
Subject: [RRG] Scaling: ad-hoc to unified or unified from the start?
From: Robin Whittle <rw@firstpr.com.au>
Date: Wed, 27 Feb 2008 14:09:55 +1100
Cc: William Herrin <bill@herrin.us>
Organization: First Principles
User-agent: Thunderbird 2.0.0.9 (Windows/20071031)
Short version:  TRRP starts off ad-hoc and decentralised with
                a pre-existing protocol (DNS) for map enquiries.

                When fully optimised to reduce delays and eliminate
                bottlenecks, it winds up being a hybrid push-pull
                system, with dozens or hundreds of anycast
                nameserver sites - and is still using DNS as its
                query protocol.

                However, TRRP can't do a proper "notify" system
                because its query protocol (DNS) cannot have
                this added.

                By contrast, Ivip starts with the final
                fast hybrid push-pull architecture and
                with a purpose-built query protocol which
                supports fast, reliable, "notify" from the
                local full database query server to every
                ITR which might need the new mapping info.


Hi Bill,

In "Re: Delays inherent in TRRP's DNS-like lookup?" you wrote:

>> Short      How TRRP's DNS system could be souped up to
>> version:   provide much faster response times.
>>
>>            This involves integrating the nameservers and
>>            anycasting them all from multiple sites around the
>>            Net.  The result is a hybrid push-pull query server
>>            network.
>
> Hi Robin,
>
> In a word, yes. For all of the reasons you list in the long
> version of your post, what you describe is a viable way of
> scaling up the middle layers of the TRRP DNS hierarchy to
> handle pretty much any imaginable projection of the routing
> system load. That scale-up is possible from day-one, without
> needing to make any adjustments to the protocol.

It is an advantage for TRRP during deployment to be able to start
with a rather casual, ad-hoc, bunch of nameservers and a few ITRs
and scale up to a tightly integrated set of anycast nameserver sites
- perhaps a few dozen to a few hundred around the world - without
changing the protocol used by the ITRs to get the mapping information.

However, the resulting arrangement is not necessarily as efficient
as if the protocol was optimised for the long-term situation of a
tightly integrated system.

For instance, you start off assuming a two query-response cycle to
get the mapping information.  Initially, the first query won't be to
anycast servers, but later it might be.  So to start with, you have
unacceptable delays.  So you develop Waypoint Routers to solve that
- though they have bad worst-case and typical delays unless each WR
is implemented with anycast at dozens of sites around the world.

Then you anycast the /8 and the /24 nameservers - presumably at the
same set of sites, as I described in my previous message, with an
example of 36 locations - so you have much lower delays in getting
the mapping.

If you could farm out more of those anycast nameserver sites into
individual ISP and end-user networks (hybrid push-pull, pushing as
far into the networks as I intend with Ivip) then your delay time
and reliability will generally be so favourable that you won't need
Waypoint Routers at all.

But then you have tens of thousands of anycast nameserver sites, and
you need to push your mapping changes to them all.  That is quite
feasible, in my view, since I can't see how TRRP would support
mobility - and mobility is the only reason I can imagine there would
ever be more than a few hundred million micronets.

However, if you were still using Waypoint Routers, they should be
anycast in the same locations.  Then you don't need the ITR to send
a mapping request and traffic packet.  Just the encapsulated traffic
packet tells the "Default Mapper"-like anycast site what information
the ITR wants.


With TRRP being deployed on a low-key scale and the nameservers (and
perhaps Waypoint Routers) all migrating to the industrial scale,
unified, anycast model, the ITR to DNS and ITR to ETR protocols
would remain the same.

However, changing from the initial completely decentralised model to
the tightly integrated model would require a great deal of
administrative change - and the willingness of all end-users and
RIRs etc. to put all their eggs in the one global TRRP anycast
server farm barrel.  The physical servers would be owned and run by
a consortium - or ISPs and large end-user networks if you provided a
feed of updates so they could run their own sites too.

I can imagine some resistance to this from end-users and/or RIRs,
unless it is made clear from the outset that this is part of the
TRRP plan.  In the final anycast plan, whoever was responsible for
sections of the address space initially (RIRs?) and who ran their
own nameservers for their xxx.v4.trrp.arpa domain, and for the
equivalent domains for WRs, will not be running their own
nameservers.  Instead, they will be turning over the job to a
consortium.  Also, the consortium will control the IP addresses of
these nameservers, since they need to fit in neatly as part of the
large-scale anycast system.

So the end-users and RIRs start off by running everything
themselves, but later have to stop this and simply give the
information they would have put in their nameservers to the
consortium.  I am not saying it can't be done, but it is a major
changing of horses midstream.  Any difficulties moving everyone over
to the centralised model to this would put back the date at which
TRRP could run with the much lower delays which are possible this way.


When people read the Ivip material on RUASes, the Launch servers and
the Replicator system - the most ambitious part of Ivip (since the
ITRs and ETRs are generally easier to implement than with other
proposals) - I hope they will remember that this is the final
system, with everyone operating by the final administrative and
technical arrangements.

While this stuff may look daunting, the Launch system only needs to
be designed and implemented once, and the Replicator software only
needs to be written once.  I don't think these systems are
inordinately complex.  I think the optimal new routing and
addressing architecture will involve fresh protocols, fresh software
and fresh administrative arrangements.

I will be exploring ways of getting the system running in a
relatively lightweight manner.  The initial volume of updates will
be very low compared to the large scale I am engineering the Launch
and Replicator systems for.

However, as I will write more about later, the mapping volumes are
not going to be as astronomical as one might think with 5 billion
cellphones hopping between 3G providers, WiFi hotspots, WiFi in the
home and subway etc. every day.  It is not necessary for the system
to support a mapping change every time a mobile node changes to
another radio network.  The ETR doesn't have to be in the radio
network - Ivip's TTR concept is for an ETR and outgoing packet
handling router outside any particular radio etc. ISP network, with
a 2-way tunnel established by the mobile host from whatever care-of
address it has via each radio link.   That ITR can remain stable as
long as the mobile host is within the same city or region of the
country.  So we won't need to support mapping changes every minute
or hour - only when the mobile node travels many hundreds or a
thousand or more km.


> There are at least a few other ways to scale the system up
> depending on the balance between cost and performance that
> users are willing to tolerate. Conveniently, these methods
> are all compatible with each other so an AS who wants more
> performance can have it while an AS who would prefer to push
> the cost down instead can have their wish too.
>
> That moves the cost/benefit decision out of the architecture
> level and in to the operations level where it belongs.

This is a good thing.  I intend that this sort of flexibility will
be provided by Ivip.  Each ISP or end-user network makes their own
decisions about where the full database ITRs and query servers are
(perhaps outside their network, for smaller networks), and therefore
where the cheaper caching ITRs are located.  These include the
essentially zero cost ITR functions in sending hosts (not behind
NAT), where an operating system upgrade provides a perfectly good
caching ITR for outgoing packets.


> Here's some good news: it's unlikely that such a scale-up would
> ever be necessary. The APT researchers report a miss rate under
> 1% for as little as 4000 cached entries.

I don't see how such simple figures could be extrapolated to
something as complex as a map-encap scheme, over all time, for all
networks, for all types of traffic.

Each host running a P2P system will be firing packets to a
constantly changing global menagerie of IP addresses its ITR has
never heard of.  P2P is really popular and won't go away.

I think a number of map-encap schemes could be attractive enough to
end-users who are too small to get conventional PI space - so they
could be introduced with some success.  The question is which one of
these would produce the best long-term benefits:  Probably the one
which is most attractive to all kinds of end-users, large and small.

Probably somewhat less than ideal performance will be fine for
people who don't have an alternative way of getting multihomable,
portable, address space.

However I think it would be best to have a system which provided
address space with so few downsides, and so many benefits, that it
would be enthusiastically adopted by big end-users with current
BGP-managed PI space.  Ideally, they would want to convert each PI
prefix they have now to be managed by the map-encap scheme, making
each one a single MAB, and therefore a single BGP advertisement -
rather than having a bunch of more specifics in that prefix, as they
probably do now.

So the less delayed packets there are, the more we can attract big
incumbent end-users and the more positive impact we will have on the
routing scalability problem.


> Having worked with caching systems before, this is not out of the
> ballpark. TRRP gains the same kind of benefit, but deployed in
> COTS hardware with gigabytes of memory it can maintain a
> substantially larger cache.

I hadn't heard "COTS" before - "Commercial Of The Shelf".

I think Ivip could be done quite nicely for the first few years with
ordinary "COTS" servers and suitable, presumably FOSS (Free Open
Source Software) software.

Over time, the big-iron routers would be upgraded with software to
support the caching ITR and perhaps full database ITR functions -
and the ETR functions.  They would handle packets faster than any
server-based software router, but smaller sites could still use
server-based software ITRs.

Until then all the ITR and ETR work could be done with servers.  The
Launch and Replicator servers are plain Intel-compatible servers
too, probably running in 64 bit mode with dual or quad CPUs, which
are becoming mass-market consumer items.

The servers need to be nice and reliable, with easy OS updates for
security patches -  but the good thing is that if one goes down, you
can grab another one and plug it in the failed unit's place.  Having
a shelf full of spare routers and router bits would be much more
expensive.  The server OS can be pretty stripped down, since it is
not running conventional applications or supporting hardware beyond
Ethernet, hard drives and probably some Flash memory.


> Even if caching failed us for some unforeseeable reason, we could
> compose a v4.trrp.arpa consisting of the top 50k /24's and
> wildcards to the /8 level NS servers for the rest.

I don't clearly understand all this, but I think I get the drift.

> Such a cache-helper zone could be pushed relatively cheaply to
> as many ASes as wanted it without disturbing the protocol.

Are you suggesting that these ASes have their own anycast
nameservers for v4.trrp.arpa and its subdomains?  Alternatively, are
you suggesting that the addresses of the authoritative nameservers
(which are still scattered round the Net) is pushed to caching
nameservers in these ASes?


> Here's some better news: the churn rate for these middle layers is
> low enough that we could get away with pushing an update as rarely
> as once a day. Once per hour is perhaps more likely for the sake
> of convenience.

OK.

> Even if the operational realities incent us to push updates to
> geographically dispersed servers, they'll be really cheap.

I agree.


> Remember, multihoming, traffic engineering and mobility happens at
> the bottom layer of the TRRP hierarchy. The middle and top layers
> only need to support major reconfiguration.

Yes.


> But even if we did end up having to push the data in the way you
> describe, there is still no single device that has to handle move
> data than COTS hardware is capable of. There are no $10M routers
> or map servers required in TRRP's architecture no matter how far
> it scales.

Yes, all the server stuff can be done with industrial, reliable,
rack-mount, dual power supply versions of the extraordinary
technology which is now bog-standard in home PCs.


>> The /8 and the /24 servers in the same racks - so why not
>> modify the /8 ones, and make them talk to the relevant /24 one,
>> to return the complete answer with the response to the first
>> query?
>
> I suspect for the same reason the GTLD servers aren't modified to
> query the second-level servers on behalf of requesters. But if
> performance is sufficiently enhanced by such an activity it could
> be done without changing the protocol.

The only way I can think of speeding up mapping responses is to
anycast authoritative nameservers for both queries to be as close to
ITRs as possible.  Then, as I suggested, a redesign of the server
system might be easier than making it conceptually separate
nameservers.  It would still behave like a nameserver, but with more
direct lookup of the complete result from a single database than by
one nameserver querying another.


>>  Around this time, you will probably be thinking . . . if we
>>  could only speed up the rsync system, we could enable the
>>  end-users to have fast enough control of the ITRs to provide
>>  Mobility.
>
> That's already tackled in a different direction with preemptive
> change notification (PCN).
>
>  http://bill.herrin.us/network/trrp-preempt.html

OK - so TRRP starts conservatively with a pure pull DNS-based global
query system.

The ITR has to make two requests to get the mapping information for
a single destination IP address, and can optionally make further
requests to find out if this is part of a micronet - to save it
making IP-address specific requests in the event it handles packets
to adjacent addresses which are in the same micronet.

Since that is slow, it has a separate part of its DNS devoted to
enabling the ITR to find an optional Waypoint Router, which will get
the initial packets to the ETR faster than would be possible if the
ITR waited for the mapping reply.

Since the pure pull nature of the mapping system is rather slow at
getting user changes of mapping data to all the ITRs in the world,
the ITRs need to do the same as with LISP and APT - make their own
decisions about multihoming restoration etc.  This means the mapping
information is complex (as with LISP and APT), since it must list
multiple ETR addresses, and give some priorities.  (Also, with
further instructions in the mapping data, LISP, APT and TRRP ITRs
can do explicit load sharing over multiple ETRs for the one
micronet, which Ivip doesn't do.)

This has the serious cost that the LISP, APT and TRRP systems
monolithically integrates these multihoming failure detection and
decision-making functions.  Ivip separates them completely - the
end-user does their own reachability and multihoming restoration
decision making, in whichever way they choose.

Since TRRP ITRs can't be relied upon to do what the end-user
specifies in real time (due to TRRP's pull and cache mapping
distribution), you add a third element to the system: Preemtive
Change Notification.  This is intended to enable ITRs to make the
sort of decisions end-users would want regarding multihoming service
restoration.

PCN type 1 attempts to find the original requester and invalidate
its cached entry, prompting it to re-request the mapping if it is
still interested.  But PCN 1 can't work reliably as an addition to
the DNS infrastructure, since it can't always reach the original
requester.

PCN 2 relies on a traffic packet being sent by an ITR to an ETR
which has decided it can no longer reach the destination network.
It fires back something to the ITR to tell it that this is not the
place to send traffic packets any more.  Then, after some checking,
the ITR is supposed to use an alternative ETR.  But this has
problems including:

  1 - What if the reason for not using the ETR is that the ETR
      is unreachable?  The ITR may not know about this quickly
      unless it did a costly set of probes on a continual basis.
      Those probes would also be a serious burden for the ETR.

  2 - What if the ETR is later able to reach the destination
      network again?  The ITR wouldn't try to send packets to
      it again until it refreshed its mapping info.

  3 - While PCN 1 would force the ITR to get the latest mapping
      information, as defined by the end-user, PCN 2 is just a
      form of reachability testing which is at best helpful only
      with multihoming restoration.

By contrast, the notification from Ivip full database query servers
(QSDs) to caching ITRs (ITRCs) via any intermediate caching query
servers (QSCs) will be local (compared to global, or anycast global
for the partially effective PCN 1) and will be a central function of
a purpose-designed protocol.  So it can be secured easily with the
nonce of the original request, and by expecting an acknowledgement
UDP packet from the ITRC or QSC.  Local notification also means good
load sharing, fast, low-cost and cheap.

So TRRP will resemble:

  An initially ad-hoc DNS based system later completely compressed
  into unified global anycast query servers.

  Waypoint Routers likewise souped up into anycast systems for low
  path stretch and good load sharing.  (Except they are not really
  needed with the unified anycast sites which form a global,
  relatively local, query server system.)

  PCN 1 won't be used because it is only partially effective at
  reaching the original requester.

  PCN 2 only diverts traffic from ETRs which are connected to the
  Net, but not to the end-user's network - so it only helps in some
  multihoming situations.

  If you could figure out a better notify system, then it would
  probably be on a global basis - or from the anycast server sites.
  Then you would have created a "push to the anycast server site"
  system complemented by a push system from each such site to every
  ITR which needs to know the mapping has changed.  Two layers of
  push would be good for performance and control - Ivip uses
  push to the full database ITRs and Query Servers, and then
  "notify" (push to the specific ITRs which might need to know)
  of mapping changes which occur inside a caching time.

  But TRRP's ITRs rely on DNS, which is not amenable to having this
  kind of notify system added to it - whereas Ivip's fresh protocols
  will be designed specifically to support this.

  Like LISP and APT, TRRP still can't do mobility, and is a
  monolithic system which doesn't give end-users real time control
  over the ITRs.

The next thing is you will want to do mobility with it, which means
changing the whole conception of the system - since you need much
better real-time control of the ITRs than is possible with pull and
cache.

I figure if you enter a plane in Bangkok and get out at Heathrow,
you want to be able to find a UK-based TTR in a few seconds and get
the global ITR system to tunnel packets to that system in another 5
or 10 seconds.  Ten minutes or an hour is not good enough.

You can't do that with a pull-based system unless you made caching
times so short that the query servers (nameservers for TRRP) would
be continually busy answering the same queries.  You can't retrofit
a good notify system to the DNS approach to querying, in part
because there could be intermediate caching nameservers which you
can't upgrade.  Also, I don't think a global notify system scales
well.  I think notify from Ivip's tens or hundreds of thousands of
full database ITRD/QSD sites will be reliable, efficient and scale
nicely.

> Two key tasks for a mobility system are:
>
> 1. Manage disconnection so that the systems with which a mobile
>    device is communicating don't quit communication while waiting
>    for the device to become available again.

Not all applications would be amenable to this, since it resembles a
long period of 100% packet loss.  However, with mobility keeping the
same IP address for a long time, at least the mobile OS could be
configured to keep that address alive as far as the mobile
applications were concerned, even if all links to IP networks were dead.

> 2. Quickly push knowledge that a change in location has occurred
>    out to the systems with which the mobile device is
>    communicating.

That would involve changing applications and operating systems in
the corresponding hosts - which could be any host in the world.

I think this is not the way to go.

Ivip's approach mobility doesn't require any changes to the OS or
applications in the correspondent hosts.

> TRRP partially addresses #1. My best guess is that #1 would
> require either a significant change in the IP stack or some landed
> station would need to take over for the mobile station upon
> detecting disconnection so that it could close the TCP windows and
> do similar tasks which cause the servers communicating with the
> mobile station to pause their transmissions without disconnecting.

Yes - with Ivip, the TTR is probably the place to do this, or at
least have the TTR tunnel packets to some server somewhere which is
aware of what has been transpiring on the mobile host so that
whatever can be kept alive will be.  That "continuity server" could
get its state from the TTR, which regularly compares notes with the
mobile OS and/or tries to deduce what applications are active on the
mobile host, by analysing the traffic.  (This would be tricky and
only partially effective, at best, and more likely generally
impossible.)

> TRRP makes the latter process relatively straightforward: simply
> change the ETR entry for that mobile device.

You would have to do this with a priority list, with the "continuity
server" being the least prioritised - so that if all available ETRs
in the mobile networks you might find the MH connecting to are
unreachable (or somehow each ETR signals to the ITR they can't reach
the MH - how to do this, and later have them tell the ITR they can
reach the MH again?), then the packets are tunnelled to the
continuity server.  But with TRRP, LISP or APT, each ITR is a
lonesome soldier, with a list of marching orders retrieved some time
ago from central command.  Each ITR needs to figure out how to
conduct itself, by trying to discern which ETRs are reachable and
which of them can reach the destination network.

Ivip is completely different.  The ITRs map each micronet to a
single ETR address, with the end-user controlling them all in
real-time - ideally within 5 seconds.  This removes a great deal of
complexity from the ITR and ETR system and makes the system much
more useful for multihoming restoration and mobility.


> TRRP directly addresses #2 with PCN. Because of the authentication
> complexity, PCN doesn't push the change out to the ITRs using it.
> Instead, it pushes only the knowledge that the cached entry is
> stale and must be updated with a new query.

PCN 1 is cache invalidation and can't work properly.  PCN 2 is not
cache invalidation at all.  So I don't see either as a way of
ensuring all relevant ITRs get new mapping information.  If you
could solve that problem, TRRP would be a lot more powerful.  You
would have created some kind of pull and cache system (perhaps with
dozens or thousands of somewhat local full database query server
sites), and you would have a reliable way of sending notifications
only to those ITRs who might need it.

(But how to ensure from an anycast site that the notification
packet, presumably UDP, really got through to the intended ITR.  You
couldn't expect to use TCP, or a UDP acknowledgement packet, since
anycast means the ITR's acknowledgement packet could be sent to
another anycast site.)

If you could do this, quickly, and ensure that end-user mapping
change commands were also pushed quickly to all your anycast sites,
then you would have achieved what Ivip intends to achieve - give the
end-users real-time control over all the world's ITRs.

Then, you wouldn't need multiple ETR addresses in each mapping
update, since at any time the end-user could decide how to restore
the service in a multihoming situation, and tell all the world's
ITRs which ETR to use instead.


Sorry if this theme becomes a little tiresome: "If we upgrade scheme
X, and then upgrade it some more, it winds up looking a lot like
Ivip or some comparable fast hybrid push-pull scheme with notify".
That is how I see things, but others may have different perspectives.


>>  > The worst case of this problem is precisely the dot-com zone
>>  > problem which is already solved and deployed at an operations
>>  > level.
>>
>>  Do you have any references on how they do this?  I couldn't
>>  easily find technical information on what software they use for
>>  the nameservers, or how the .com database is maintained and
>>  somehow made accessible to multiple registrars.
>
> Dot-com and dot-net are generated out of a central SQL database
> maintained by Network Solutions if I remember right. The various
> registries have access to an API that lets them add and remove
> information. An hourly dump is done into a zone file which is
> distributed to the contractors running the gtld servers, some
> anycast some not.
>
> The details are probably buried in NSI's contract with ICANN
> somewhere.

OK - thanks for this explanation.


>>  Ivip needs more protocol and software development and an
>>  up-front deployment of servers by some sort of consortium to get
>>  started.  I like to think this can be done due to the enthusiasm
>>  generated by a system which promises to fix the routing
>>  scalability problem, provide many more end-users with
>>  multihoming, portability etc.
>
> I'd like to think so too, but the pragmatist in me won't allow it.
> We'd have safe, efficient autodrive cars if only we had a
> passenger car-sized rail system for them to run on. It's not that
> we couldn't build such rails or even that they'd be more expensive
> to build and maintain than asphalt roads. It's that we have a
> massive deployed infrastructure of asphalt roads and no deployed
> infrastructure of passenger car-sized rails.
>
> Ideas can be revolutionary but infrastructure construction must by
> necessity be evolutionary. It's like the way to Millinocket, "You
> can't get there from here."

My task is to show that Ivip (or any similar proposal) would get
everyone somewhere significantly better than the current
alternatives - LISP, TRRP or APT - would.  I also need to show that
Ivip's development and deployment will be lightweight enough that
money can be made and/or saved in short a timeframe.

I am finding these discussions about TRRP and APT very helpful.
Thanks for responding in detail.

  - Robin


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg
Prev by Date: Re: [RRG] getting rid of longest match
Next by Date: Re: [RRG] Six/One Router: Provider-Independence, IPv4/IPv6 Interworking, Backwards-Compatibility
Previous by thread: [RRG] ILNP Concept of Operations
Next by thread: [RRG] Scaling, Mobility & 228 mapping changes a second
Index(es):
- Date
- Thread