[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RRG] Re: Delays inherent in TRRP's DNS-like lookup?



Short version:  How TRRP's DNS system could be souped up to
                provide much faster response times.

                This involves integrating the nameservers and
                anycasting them all from multiple sites around the
                Net.  The result is a hybrid push-pull query server
                network.


Hi Bill,

You wrote:

>> So for IPv4, you think the RIR's server would do this task of 
>> handling up to 2^16 subdomains.  That sounds technically
>> feasible, but would the RIR really want to do this?
> 
> That is how things are constructed for the in-addr.arpa domain.
> Most allocations are longer than /16, so the corresponding
> delegations happen as /24's. It works in practice, today.
> 
> The scaleups necessary for the roots and tlds would probably be 
> necessary for the /8 level of the TRRP hierarchy, but those are
> also well tested.

OK, based on other things you wrote, I understand that each
nameserver which is authoritative for a /8 subdomain
xxx.v4.trrp.arpa will have a bunch of anycast nameservers which are
authoritative for about 64k /24 subdomains.

This means a typical IPv4 TRRP mapping lookup involves a query to
one of the /8 nameservers followed by a query to the nameserver
(presumably not anycast) which is authoritative for the /24 in which
the destination IP address lies.

Anycasting these /8 nameservers, say to 18 different locations (my
guess for the purpose of discussion) means you are spreading the
first lookup load over about 220 conceptually different nameservers
(according to the need to look up addresses in each /8, which will
vary according to traffic sent to each /8) - so this is not the same
as evenly spreading the whole load over 220 nameservers.

(220 is a guesstimate of the /8s 0 to 223, minus 10, 127 and
probably a few others which won't be used for public unicast space.)

Then, assuming the anycasting is done in a consistent, efficient,
regularised approach you have 18 sites around the world which
anycast all 220 nameservers.  This spreads the load further,
reducing the load per nameserver by very approximately a factor of
18, depending on the locations of the 18 sites.

You could have each nameserver, for instance the one (actually you
would have at least two nameservers on two addresses in at least two
different locations) which is authoritative for the prefix
123.0.0.0/8, that is the one authoritative for the domain:

  123.v4.trrp.arpa

exist in various locations and addresses.  The diversity is in three
dimensions.

Dimension 1: One of two IP addresses.

The system starts with the domain v4.trrp.arpa specifying two IP
addresses for two physically different and geographically separate
nameservers:

   11.22.33.44   For instance in New York.
   55.66.77.88   For instance in Tokyo.


Dimension 2: 18 anycast sites for each of these.

Then, assuming you want to anycast these to 18 sites, you need a
total of 36 sites around the world.

There's not much point in having a single site which anycasts both
addresses, because if that site died the whole catchment area for
that anycast site would have no nameserver for this domain.

So you pick 18 sites scattered around the world for each IP address.
 For instance:

   11.22.33.44    Sydney, Bangkok, Tokyo, Beijing, Kolkata,
                  Saint Petersburg, Istanbul, Cairo, Warsaw,
                  Berlin, Stockholm, Edinburgh, Milan,
                  Sao Paulo, Montreal, Chicago, San Francisco,
                  Hawaii.



   55.66.77.88    Melbourne, Singapore, Singapore, Hong Kong,
                  Mumbai, Moscow, Jerusalem, Cape Town, Athens,
                  Rome, Paris, London, Madrid, Buenos Aires,
                  New York, Dallas, Los Angeles, Auckland.


Dimension 3: The load is spread over 220 servers, rather than
having the queries go to a server which is authoritative for
v4.trrp.arpa.

So while this is a global query server system, the average path
length to the nearest anycast nameserver will be some smallish
fraction the radius of the Earth.

This is definitely faster than a global query server system with a
single server "somewhere" - 0 to half the circumference of the Earth
(or worse, due to paths not being directly in a half-circle to the
server).  Even this single global server approach, via ordinary
Internet links, would be faster than a single server somewhere, with
a circuitous set of hops between routers, as with ALT, which makes
the total path length typically much longer still.

So far, all this has done is increase the robustness and decrease
the time delay for the first query.  So far, so good.

In order to anycast the servers like this, you are probably going to
form a consortium to do the whole thing in an orderly fashion.
Though I suppose you could have one consortium for one set of
servers (of which 11.22.33.44 is in the A set, and 55.66.77.88 the B
set).  Then Consortium A sets up 18 sites and Consortium B sets up
the other 18, in generally different locations.

This pooling of resources makes the anycast system less expensive
than if the RIR which runs 111.0.0.0/8 and a bunch of other /8s has
to build its own set of anycast sites, while other RIRs etc. need to
do the same independently.



The second query is probably going to be to a single server
somewhere, or to one server chosen at random from a short list, with
them being geographically dispersed for robustness.  The path to the
chosen server is typically between 0 and about half the
circumference of the Earth.

Operators of /24s which have a lot of traffic could choose to
anycast their .trrp.arpa nameservers, so that on average there is
only a relatively short path to nearest one - but that could turn
into an awful lot of nameservers.  Unless those anycast servers for
the addresses which are run by many different operators were
consolidated into some smaller set, they would chew up quite a lot
of /24s to anycast them, as well as being expensive to run since
they are so numerous.


Maybe there is some snappy way you could have both kinds of
nameserver anycast, with relatively few physical servers to handle
the entire job.  Then, to the extent that you anycast them both, you
are both spreading the load and reducing typical paths for the two
queries and the two responses.

This turns what began as a global query server system into a bunch
of servers which are more local to each ITR.

One advantage of this is that you can start small, with a pair of
ordinary nameservers, and without changing the protocol or the ITRs
scale the system up to handle very large numbers of queries, with
shorter response times and greater reliability.


With a hundred or so sites around the Net, the great majority of
ITRs would be geographically close, such as a few percent of an
Earth radius - with the more remote ones still being close enough
that the delay is greatly reduced.

But then, you have just created a hybrid push-pull system of sorts,
since you need to push mapping information out to each of the 100
anycast server sites!

If you are doing this slowly, you could probably use rsync.


If you could maintain the pattern of ITRs caching the addresses of
the servers authoritative for 220 or so /8s, and then have the ITRs
doing a query and getting a response, then making another query and
getting a response, then provided you weren't troubled by packet
losses and the servers always responded, you would have a pretty
fast, generally "local" query server system.

But what if some large ISP wanted its own set of anycast servers?
They may prefer to get an rsync feed and run their own anycast
servers, to improve the response time of their ITRs (and so gain a
marketing advantage).  For instance, some USA-based ISP wants to put
a pair of anycast sites, just like those described above, in Houston
and Atlanta.   They are happy to pay for the servers - so this
reduces load on the main system.

Pretty soon, there could be many more anycast sites, especially as
each one is taking less load than the 36 big sites, and so can
probably be implemented in a smaller number of servers.  Next, the
USA-based ISP decides to put a pair of nameservers in each of their
data centres, one for each set of IP addresses for each of the pairs
of servers for each xxx.v4.trrp.arpa domains.  They use their
original rsync feed and fan this data out to their multiple sites.

Now, you can't very well alter the IP addresses of the nameservers,
because multiple ISPs such as this would have the disruptive task of
changing their many sites to new addresses.   So, initially,
ideally, before things got to this stage, you would probably want to
choose IP addresses for the various xxx.v4.trrp.arpa sites so that
they all fitted nicely into a single /24, which could be anycast
with less fuss than a bunch of /24s.  This would be further pressure
for the initial deployment of nameservers to be done by a
consortium, rather than completely independently by multiple RIRs etc.

The RIRs or that consortium would need to charge the end-users whose
address space was in these /8s, because some of that address space
will generate very few queries and some will generate a lot.  So
there would be enquiry traffic based billing for running these
anycast nameservers.  A low-key outfit with a lightly used /16
wouldn't want to be paying a high flat fee which subsidises another
end-user such as Google etc. whose /16 generates ten thousand times
more queries for the anycast nameservers than their own.

In order to make this system run efficiently, in terms of scaling,
low delays etc. I think you would face many pressures to tightly
integrate the whole thing, from the outset, so it could be scaled
up, anycast, farmed out to ISPs who want to run their own anycast
servers etc.

In principle, I think you could make a global, distributed,
nameserver system for the /8 part of the query, with generally low
delay times and high query capacity.

The /24 query is a different matter.  If the nameservers for these
were run by end-users, then there would be a gaggle of them, and
only the keenest end-user would anycast their nameservers - to
spread load, provide robustness and to reduce response times.

One way to optimise the /24 response time is to unify the operation
of the nameservers for these as well.  This would require some kind
of consortium.  The most obvious first step is to run the /24
nameservers in the same rack as the /8 nameservers - and then
anycast the whole thing to the 36 locations mentioned above, and to
more as time goes on, including letting ISPs duplicate the whole
thing at their own expense wherever they like.

Now the operation of this scheme becomes technically and
administratively monolithic.  But this is clearly the way forward in
terms of improving robustness and reducing delay.

The /8 and the /24 servers in the same racks - so why not modify the
/8 ones, and make them talk to the relevant /24 one, to return the
complete answer with the response to the first query?

Then you only have one query and one response - at least halving the
total delay time in getting mapping.

It would probably be easier to rewrite the server to do this job
specifically, so while the software would create visible nameservers
for the various levels of sub-domain, including the /24 level, it
would be able to answer the whole query from the /8 level.  There
would need to be a large, unified, local database which the server
would query.  That could be on a separate machine in the rack, or
you could integrate the whole thing into a single bunch of software
on a single server.  Then, almost anyone could run one of these
things, as long as they had an rsync feed.

Now you have a genuine hybrid push-pull system of generally local
query servers around the Net, with fast (a few tens of milliseconds
in most cases, out to 100msec to 200msec at worst case for the most
remote ITRs) responses, and a single query and response.

This local and massively replicated anycast server system means you
can reduce the caching time of the responses, improving the ability
of ITRs to change their behaviour quickly according to the
end-user's needs.

Around this time, you will probably be thinking . . . if we could
only speed up the rsync system, we could enable the end-users to
have fast enough control of the ITRs to provide Mobility.

You would retrofit the system with a fast-push mapping distribution
system and arrive, by a circuitous path, at a system which supports
mobility and bears some resemblance to Ivip.

My plan is to start with fast push so we can support Mobility from
the outset, and have the mobile folks help pay for the mapping
distribution system!

This series of upgrades to TRRP has the advantage over Ivip in that
you are starting with a completely decentralised set of query
servers, using the ordinary DNS system and software - whereas Ivip
needs more protocol and software development and an up-front
deployment of servers by some sort of consortium to get started.

I like to think this can be done due to the enthusiasm generated by
a system which promises to fix the routing scalability problem,
provide many more end-users with multihoming, portability etc. (and
Ivip's TE options, which are less fancy than those of other
schemes), improve IPv4 address utilization and provide a gutsy, new,
highly valuable, kind of mobility.


>> Another critique is that this RIR server is going to be very
>> busy indeed.  You could anycast these /8 level servers, but
>> that would be costly and harder to administer, since they all
>> have to know so much about the authoritative nameservers for up
>> to 64k subdomains.
>> 
>> Broadly speaking, if there are about 220 /8s in use in IPv4,
>> then each such RIR server (really multiple servers) is going to
>> get 1/220 of the total global ITR initial requests for mapping
>> information.
>> 
>> Wouldn't it be more likely that the RIR gave a /12 of this /8
>> to some ISP, and would simply hand back an answer saying to ask
>> the nameserver of that ISP?  Then that server might delegate to
>> another one, which is authoritative.
> 
> Not if it harms performance. At one point the first letter of
> .com domains was reserved against the possibility that they'd
> need to do that sort of delegation but the operators chose
> anycast instead.

Anycast is a good way to scale the system up and reduce delay times.
 It has disadvantages in terms of being harder to detect problems
from afar, for instance if one of the anycast sites is unreachable
or not working properly.


>> looking at your IPv6 example in the same light, the figures are
>> more extreme.
>> 
>> Your first asked server is authoritative for a whopping /12 of
>> IPv6 space.  It gives an answer about what server to ask,
>> solving another 36 bits of the ITR's problem:  it tells the ITR
>> the address of another server to ask, which is authoritative
>> for a /48.
>> 
>> Do you expect this /12 server to know about (a theoretical
>> maximum) of 2^36 subdomains?  That is an awful lot.
> 
> That is essentially how the ip6.arpa hierarchy for reverse-DNS is
> organized today. No problems are experienced or forecast.

OK, those /12 servers are going to get a hammering, like the /8 ones
in IPv4.  Anycasting them would spread the load and reduce the
response time - but then you have to push the database to each
anycast site.


>> Overall, I question how your collapsing of what would otherwise
>> be a long series of lookups into two or three will be
>> problematic in terms of:
>> 
>> 1 - The business and therefore trust relationships probably go
>> in steps of fewer bits than the large jumps in bits you use in
>> the examples - raising questions of how the short-prefix
>> nameservers get to be reliably configured with so much detailed
>> information which is actually controlled by so many ISPs or
>> end-users.
> 
> In a PI world, the administrative relationships -are- relatively
> flat:
> 
> IANA-RIR-Org-Individual and IANA-RIR-Individual
> 
> It's only in the PA world we're trying to eliminate that the 
> administrative relationships get deep:
> 
> IANA-RIR-LIR-LIR-LIR-Org-Individual

The optimal way of doing TRRP seems to be with a large number of
anycast query server sites which answer the full DNS query
immediately.  Therefore, the mapping database needs to be
distributed to every such site.  This involves a high degree of
collapsing what is initially conceived of as a delegated, widely
distributed, independently run, large set of nameservers into a
cleanly replicated set of anycast server sites, all dependent on
either a single feed of updates, or multiple feeds from different RIRs.

The technically optimal method involves a high degree of
integration, but perhaps you would work towards that as traffic
volumes rise, starting with a relatively ad-hoc, casual, set of
nameservers run by RIRs, ISPs and end-users.

With Ivip, I want to start with a lightweight version of the highly
integrated fast hybrid push-pull system - so from the outset it can
do Mobility and so it can grow to more and more sites for full
database ITRs and query servers, whilst retaining the original
structure.



>> 2 - How this collapsing thwarts the ITR's ability to cache a
>> larger number of nameservers which will actually be used in
>> subsequent requests - thereby requiring it to keep asking the
>> /8 (IPv4) or /12 (IPv6) servers time and again.  The same goes
>> for the world's ITRs and so these servers get a hammering.
> 
> The worst case of this problem is precisely the dot-com zone
> problem which is already solved and deployed at an operations
> level.

Do you have any references on how they do this?  I couldn't easily
find technical information on what software they use for the
nameservers, or how the .com database is maintained and somehow made
accessible to multiple registrars.


>> The real question is whether the delays inherent in an ALT or
>> TRRP based system are enough for at least some marketing folks
>> to trump it up to the broader end-user population as being
>> significant.
> 
> Of course they are. And if it's not the milliseconds it'll be the
>  bytes. You have to transmit more bytes with map-encap so it'll
> cost you more.
> 
> The marketing game is to make sure that at least some people see
> the advantage of selling the new capabilities. "With map-encap,
> we can sell you PI for $20/month. That loser who says map-encap
> is crap won't sell you PI for less than $2000. Two-kay. Twenty
> bucks. You tell me who's full of crap."

Sure - the people without $2k, $8k or whatever have no choice but to
use the map-encap scheme and, with TRRP or ALT, put up with initial
packet delays.

I am aiming for a map-encap scheme without significant initial
packet delays.   Then the existing big-time PI folks would be
encouraged to adopt the new kind of address space management.  Even
though it is their own already paid for PI space they would be
managing with Ivip or whatever, they would be able to slice and dice
it much finer than the prevailing (and likely to remain so) IPv4 BGP
system can, with its /24 256 IP address granularity.

This way, an existing end-user with a /16 who advertises it as
"multiple" longer prefixes can convert their system to an Ivip MAB,
which burdens the BGP system with only a single a single
advertisement.  That reduces the DFZ routing table by "multiple
minus 1" routes.  It also encourages new entrants to adopt
Ivip-managed address space, rather than hang out for conventional
BGP managed PI space "because that is what all the big boys use".

  - Robin


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg