Re: [RRG] cache issues in LISP and CONS

Obviously, this is a dead end street and I can only observe how this community walks to its very end.

There however the community might remember that according to Rekther's law there is also a second way.

Heiner

In einer eMail vom 18.10.2007 08:10:07 Westeuropäische Normalzeit schreibt rw@firstpr.com.au:

Hi Dan and Noel,

Dan wrote:

> The issue regarding packet drops with LISP and LISP-CONS has been
> brought up a few times on the list.
>
> Basically, packets are dropped for every ITR cache miss. Since
> CONS mapping requests may take a long time to be satisfied, this
> may result in unacceptable service.

LISP-CONS, LISP-NERD and TRRP use ITRs which cache mapping data and
rely on a query system which spans the Internet to get fresh mapping
data.

I think that dropping packets - or delaying them while queries and
responses cross the planet - is not acceptable for the future
architecture of the Internet.

Each such exchange typically involves multiple packets being sent
and received, with delays and potential losses. Each mapping
request could take a few seconds to resolve under difficult
circumstances, and half to one second under common circumstances.
Folks in US and Europe need to remember that other folks have longer
packet paths - such as 350ms round trip delays from Melbourne
Australia to the Netherlands under close to ideal circumstances.

DNS-style mapping lookups (LISP-NERD or TRRP) will often involve
multiple queries to find the right nameserver. This is especially
so for IPv6, with the likelihood of having to row through 48 bits or
more, 4 bits at a time with TRRP or I think LISP-NERD. There's no
way an ITR could cache all those intermediate nameservers, even for
IPv4.

An ITR might be required to drop a packet and query the global
system in these circumstances:

1 - A uses DNS to find IP address of B, and B's domain name
requires the query go to a nameserver which uses a
LISP-etc. mapped address.

2 - A is on a mapped address too, so for each request to a
nameserver in the above query, the nameserver's ITR needs
to do a LISP/TRRP CAR-CDR/DNS-like lookup to find the ETR
by which A can be reached.

3 - A's ITR needs to do another mapping lookup before A can
send a packet to B.

4 - B's response requires another lookup, since A is on a
mapped address too.

Point 1 and 2 can occur in pairs for as many depths of nameserver
recursion as is required to get an answer. For instance five or so
times to find the IP address of "aa.bb.cc.ee.ff.gg".

At every such query, the LISP or TRRP ITR drops the packet and the
higher layers have to time out and retry. Here is an example of
establishing a TCP link to a host which is identified by a four
level domain www.xxx.com.au.

The client host A is on a LISP-mapped address, as are the
nameservers for .com.au and .xxx.com.au.

We assume the client host A has cached the address of the .au
nameserver, (this would not be the case for the less used TLDs) but
not of the nameserver for .com.au. Likewise, we assume A's LISP-ITR
initially has no mapping for the address of the .com.au nameserver.

This is perhaps unrealistic, since we wouldn't expect the .com.au
nameservers to be on LISP-mapped addresses. However the following
example is valid for a webserver of a department of a company or
university, such as:

   www.astro.someuni.ch
   www.sales.wizmo.es

where these organisations use mapped addresses for their networks.
This is especially the case in the future when LISP/Ivip-etc.-mapped
address space is likely to be very widely used.

For instance maybe someuni.ch runs one of its nameservers on its own
LISP-etc.-mapped network and a second one on another university's
LISP-etc.-mapped network - but the astronomy department has a
separate campus with one of its nameservers on another portion of
LISP-etc.-mapped address space etc.

A -X-> ns1.com.au LISP ITR drops the 1st packet.

   A times out. How long does the first time-out
   take?

A ---> ns1.com.au A sends a 2nd packet, which the ITR now has
   mapping for (ideally - but maybe the ITR is
   still awaiting a response from the global
   CAR-CDR or DNS-like query system) and so
   tunnels to the ETR which serves ns1.com.au.

A <-X- ns1.com.au The LISP ITR near ns1.com.au drops the packet
   - it has no mapping for A's address.

   A times out again. How long does the second
   time-out take?

A ---> ns1.com.au A sends a third packet.

   Or has A given up on ns1 and tries to send
   a query to ns2.com.au? This is on a totally
   different network, and so we REPEAT (not
   shown here) all the above steps.

   Now (ideally) the LISP ITR near A has mapping
   information for nsx.com.au - so finally, A
   gets its query to nsx.com.au.

A <--- nsx.com.au Ideally, the nameserver's ITR has mapping for
   A by now, so the packet is tunneled to A's
   ETR and reaches A.

Repeat the same stuff (as detailed below) so that A can
find out the address of the nameserver for xxx.com.au.

A -X-> ns1.xxx.com.au A's 1st packet is dropped by the LISP ITR.

   A times out.

A ---> ns1.xxx.com.au A sends a 2nd packet.

A <-X- ns1.xxx.com.au LISP ITR near ns1.xxx.com.au drops packet.

   A times out again.

A ---> ns1.xxx.com.au A sends a third packet.

A <--- nsx.xxx.com.au A receives the IP address of
   www.xxx.com.au.

Now a similar pattern of dropped packets and time-outs
so that A establishes a TCP session with the web server.

The web server isn't necessarily on the same LISP-mapped
address space as the nameservers. It could be at a hosting
company, which uses LISP-mapped space so it can move its
upstream connections to other ISPs without all its
customers having to change their DNS entries for their
web servers.

A -X-> www.xxx.com.au A's 1st packet dropped by LISP ITR.

   A times out.

A ---> www.xxx.com.au A sends a 2nd packet. By now, ideally
   A's ITR has the mapping data.

A <-X- www.xxx.com.au LISP ITR near www.xxx.com.au drops
   packet.

   A times out again. www.xxx.com.au has
   a half-open TCP session dangling . . .

A ---> www.xxx.com.au A sends a third packet. The webserver
   half-opens a second TCP session.

A <--- www.xxx.com.au A gets its TCP acknowledgement and
   sends its response . . .

A ---> www.xxx.com.au . . . which (ideally) will be tunneled
   immediately.

This is arguably worse than some common circumstances, but it is
easy to think of more difficult cases, with a greater recursion of
nameservers. Any packets dropped en-route make things still worse.

This is:

16 packets sent (not counting flurry of ITR query traffic)

   6 dropped packets

   2 1st time-outs by A's name lookup code.

   2 2nd time-outs by A's name lookup code.

   1 1st time-out by A's TCP session establishment code.

   1 2nd time-out by A's TCP session establishment code.

With eFIT-APT or Ivip, there are no dropped packets and we have:

A ---> ns1.com.au
A <--- nsx.com.au

A ---> ns1.xxx.com.au
A <--- nsx.xxx.com.au

A ---> www.xxx.com.au
A <--- www.xxx.com.au
A ---> www.xxx.com.au

   7 packets sent

   0 dropped packets

   0 time-outs

In this example, LISP or TRRP typically makes the user and higher
level protocols wait for 6 time-out periods. I think this is really
unacceptable.

The delay would be less with only one level of DNS recursion, and if
only one end of each exchange using LISP-mapped address space.

If something like LISP or TRRP was introduced, I think people would
soon find that the address space it handles sucks - due to the
common experience of excessive delays in establishing sessions of
any kind. It would be very much harder to convince people to adopt
this space if it means a permanent degradation of their own
experience and of every person who tries to communicate with them.

> Suggestions have been made to route packets on the old topology in
> the event of ITR cache misses. However, this leads to a major
> incremental deployment issue -- since LISP adopters will still
> need to maintain their routes in the old topology, there would be
> no reduction in the size of the global routing table.

I agree there would be no absolute reduction, but there could and
should still be less BGP prefixes than if LISP etc. was not
introduced. I discuss this below in my response to Noel.

> I have not seen any other suggestions on how to handle this issue.
> Could this be a fundamental problem with the design, or are there
> other solutions?

It is a fundamental problem with having caching ITRs which can't let
a packet go to another ITR which has the full database, and where
the caching ITRs rely on a global-sized - and therefore slow and
unreliable - query system.

Eliot Lear proposed (RRG messages leading to 264 & 288) that
LISP-NERD could have one or more ITRs well outside the sending
host's ISP's network to catch those packets which are not caught by
an ITR in that network. I think this amounts to the same thing as
Ivip's "multicast ITRs in the core". I think this was meant as a
means of making LISP-NERD incrementally deployable. Perhaps, if the
LISP-NERD ITRs were modified to pass the packets they couldn't
tunnel, this would mean that the packets would be tunnelled
immediately, rather than dropped.

In that case, LISP-NERD would resemble eFIT-APT and Ivip in having
some caching ITRs which could let packets go through to a full
database ITR (Default Mapper for eFIT-APT, or ITRD for Ivip) when
the caching ITR didn't have the mapping information. Perhaps with
Eliot's suggestion, those "full database" ITRs are only
"full-database" for one BGP-prefix's range of destination addresses,
and perhaps there is only one such ITR for each such prefix, rather
than necessarily multiple of them using anycast.

As it stands, here is the situation as I understand it. (Please
check later in this thread for corrections):

LISP-CONS LISP-NERD eFIT-APT Ivip TRRP

Full DB ITRs?    No    No    Default ITRD No
Mappers

Caching ITRs?    All All ITR ITRC All
   ITFH

Local query    No    No    Default QSD    No
servers for    Mappers QSC
caching ITRs?

Caching ITR must Yes Yes No    No Yes
drop packets it
has no mapping for?

Distribution of    Pull    Pull    Push    Push Pull
database to    CAR-CDR DNS-like   slow    fast DNS-like
networks with    global    global    via Repli- global
ITRs etc.? network network BGP cator   network
   system

Incremental RRG    Yes
deployment via    messages
"anycast ITRs in    264 & 288
the core"?

TRRP also has a method by which some ITRs (depending on how directly
they queried the authoritative DNS-like mapping information servers)
get push notification of changed mapping from the authoritative
mapping servers. However I can't see how this would scale well.

eFIT-APT and Ivip are the only two proposals in which all packets
are tunneled without delay.

Noel Chiappa wrote, in part:

DJ> Suggestions have been made to route packets on the old topology
in the
DJ> event of ITR cache misses. However, this leads to a major
incremental
DJ> deployment issue -- since LISP adopters will still need to maintain
DJ> their routes in the old topology, there would be no reduction in the
DJ> size of the global routing table.
>
> Well, it's not just an incremental issue; either i) we always maintain the
> backup routing (in which case we don't get the size reductions, as you point
> out),

I disagree with this prevalent notion that maintaining a prefix in
BGP while that prefix is handled by an ITR-ETR mapping scheme will
not result in reductions in the size of the global BGP routing table.

This would only be true if every such prefix was used to serve a
single end-user network. The major benefit of all these ITR-ETR
schemes is that they can slice and dice the address space much finer
than BPG, which is limited (for all practical purposes in the
foreseeable future) to 256 IP address chunks in IPv4.

If end-user networks all needed more than 128 IP addresses, then it
might be true that the ITR-ETR mapping systems can't help use IPv4
address space more efficiently.

However I believe there are a large and growing number of end-user
networks which require multihoming and/or address space which
doesn't change when a new ISP is used. ("Serial multihoming" as
Iljitsch wrote - though call this "portable address space", with
apologies to those whose teeth itch at the mention of this phrase.)

On this list a few weeks ago, several other people supported my view
that there is a significant number of end-user networks with smaller
(than 128, I guess) IPv4 address requirements who would be well
served by LISP/eFIT-APT/Ivip/TRRP-mapped space, and who would
contribute to the growth of the global BGP routing table if such an
ITR-ETR mapping system was not introduced.

> or ii) we don't maintain them into the indefinite future, in which case
> the speedup of using the backup routing would not be available in the future.

As long as Ivip or whatever achieves benefits by enabling lots more
end-user networks to operate as they require without each one
getting their own BGP-advertised prefix, then I think there's no
need to worry about maintaining the larger (shorter) prefixes in BGP
when they are mapped by LISP etc.

> But since I think we can make the resolution adequately fast, I don't think
> there's a problem here.

You could probably create a complex combination of caching and
genuine push of update messages to make the CAR-CDR network faster.

The question is whether it would be easier to create a complex and
souped up CAR-CDR system than to establish a clean - although
ambitious - system to distribute mapping data globally, within a few
seconds, as I am planning with Ivip.

My current description of the Replicator system is broad, so it is
hard to make comparisons of difficulty and cost now. Over the next
few months I hope to write up some more elegant, secure and concrete
details.

- Robin

Re: [RRG] cache issues in LISP and CONS - it's bad . . .