[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr

To: Routing Research Group list <rrg@psg.com>
Subject: Re: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
From: Robin Whittle <rw@firstpr.com.au>
Date: Mon, 26 Nov 2007 00:49:28 +1100
Cc: Iljitsch van Beijnum <iljitsch@muada.com>
In-reply-to: <D04722BA-1782-44F9-9744-AC8E9591DC16@muada.com>
Organization: First Principles
References: <4746809F.5020604@firstpr.com.au> <47005293-509C-4E35-8D06-4E0F69A514C1@muada.com> <474792F8.2060006@firstpr.com.au> <D04722BA-1782-44F9-9744-AC8E9591DC16@muada.com>
User-agent: Thunderbird 2.0.0.9 (Windows/20071031)

Hi Iljitsch,

Thanks for your response and offer of the correspondence leading to
Fred's proposal - I won't read that just now.

Hopefully some other folks can scrutinise Fred's proposal and mine:

  http://tools.ietf.org/html/draft-templin-inetmtu-06
  http://www.firstpr.com.au/ip/ivip/pmtud-frag/

ITR-ETR schemes are fresh, and definitely need some kind of PMTUD
system, so it should be fine to build in special functions at both ends.

Thanks for your account of the fragmentation reassembly problems of
IPv4.

> I don't have any references, but in short, the issue is that you
> have a 16 bit ID space with a reassembly timeout of something
> like a few minutes. This means you can only send 65536 packets
> during that "few minute" window or you'll incorrectly reassemble
> fragments from different packets if you lose a fragment. This is
> especially problematic if the fragmented packets belong to a
> tunnel because in that case the IP source/dest addresses are
> always the same.

In a fresh system such as an ITR-ETR scheme, perhaps a workaround
for this would be to set the maximum reassembly time at the ETR to
something very much shorter, such as 1, 2 or 3 seconds?

That could be a pain where the ETR is an additional function in a
server or router with a pre-existing TCP/IP stack.  Then it would
probably not be possible to shorten the time just for the ITR to ETR
packets.


>>> It also costs you lots of CPU and could even allow for CPU 
>>> exhaustion attacks.
> 
>> Yes, but I think it is better than dropping longish packets
>> just because we assume some too low PMTU of 1280 or whatever,
>> when in fact, within a second or two, the ITR will probably be
>> able to establish that the real PMTU is 1500 or somewhat less.
> 
> Who said anything about preemptively dropping packets? Just send 
> 1500-byte packets + an outer header with DF set and you'll get a
> "too big". After that, you know the path MTU and you can in turn
> send too bigs to the source of the original packets.

OK - here's the Devil's Advocate view:

Firstly, the ITR can't be sure it will get a Packet Too Big (PTB)
message - that could be dropped.

Secondly, there could be another lower MTU limit beyond whatever
sent the first PTB message.

Your algorithm, as I understand it, does not involve the ITR
fragmenting any packets.  All the ITR can do when it gets the PTB
message is send a PTB back to the sending host.  This really slows
things down, since the sending host has to create another shorter
packet and the whole process repeats itself.

With my IPTM approach, as long as the ITR doesn't yet have a
reliable estimate of the PMTU to the ETR (which it takes a few
seconds to establish with an explicit probe protocol, not using
traffic packets, and not relying on PTB messages), it fragments any
packet which is longer (after encapsulation) than the default PMTU,
which I suggest be something like 1280 bytes.  This only persists
for a few seconds while the probing takes place.  After that, the
ITR has a good idea of the PMTU and then sends a PTB to the sending
host when it next gets a packet which would exceed this, once
encapsulated.

In my scheme, when the sending host does get an PTB message, it gets
it nearly instantly from the (typically) nearby ITR, rather than
from the ITR in your scheme with a long delay due to the long path
the encapsulated traffic packet took from the ITR to some router in
the tunnel and the delay of the PTB returning to the ITR.  Maybe you
don't think much of this in Europe or the USA, but from Australia
these delays are significant.

The round trip to your Netherlands based mailserver 83.149.65.1 is
162ms from my web server in San Francisco, and 343ms from a
well-connected (ISP, not via DSL) server in Australia.

So my approach shouldn't involve significant delay in packets,
whereas yours does significantly delay the first "long" packet.

If there is a second shorter MTU limit, the packet which hits that
will be delayed similarly as well.  These packets would be lost if
the PTB message didn't get back to the ITR.

Also, in Ivip, traffic packets from the ITR to the ETR sent with the
sending host as the outer source address, not the ITR's address.
(So any PTB message would go to the sending host, which would ignore
it, since the destination address is of the ETR, not of the
destination host.)

In Ivip, an ITR couldn't use traffic packets as probes for PMTU to
the ETR - at least in a system which relied on PTB messages.  Using
traffic packets as probes would require some special signalling to
the ETR, in another packet or in a more complex header of the
tunneled packet, to tell the ETR to send an acknowledgement to the
ITR.  I think this complexifies the encapsulation system - so I use
a completely separate probe protocol, between the ITR and the ETR,
which has nothing to do with traffic packets.  I would rather add
probe packets to the total system load than have the lost effort,
delayed delivery and additional complications of using traffic
packets as probes.

With synthetic probes, the ITR can push its luck as far as it likes
with larger packets until one size doesn't generate
acknowledgements.  This is a very robust way of determining PMTU,
without relying on PTB packets - though of course the ITR would use
any which it received.

If you can only use traffic packets to probe the PMTU, you are
restricted to the lengths which result from their length and the
encapsulation overhead.


My approach is:

   While PMTU to the ETR is being established:

      Fragment traffic packets which (after encapsulation) are
      longer than the default PMTU - even if the DF flag is set,
      since this is only within the tunnel.  This gets the data
      through, in most cases.  Reassembly is in the ETR, so the
      destination host gets one packet.

   Once this has occurred, probe the PMTU to the ETR with a specific
   protocol and artificial probe packets - not with traffic packets.

   Then the ITR can send a reliable PTB message to the source, in
   the likely event that it gets another packet which is too long
   to fit in the tunnel, once encapsulated.

   The result is that all packets are delivered, except the first
   one for which the ITR sends back a PTB message.  Then the sending
   host has been reliably informed of the entire PMTU situation and
   adjusts subsequent packets accordingly.  There is no recursing
   of your algorithm, with delayed delivery in each instance, as
   would be required if there was first a 1450 MTU limit and then
   further along the tunnel a 1400 limit (unrealistic examples).

If the ETR simply forwards packets to the destination host, and
there are PMTU limits between the ETR and the destination host,
these are dealt with by the sending host, not the ETR or ITR.  If
the ETR tunnels to the destination host, then that tunnel mechanism
needs to handle the MTU problems - such as by using Sprite.

The task of my scheme is purely related to the tunnel between one
ITR and one ETR.  The same ETR could have multiple destination
hosts, each with their own MTU limits.


> Yes, this allows for PMTUD black holes, but those are subject to
> the "so don't do that and the problem goes away" doctrine. ISPs
> generally get this, unlike enterprise people and ignorant
> consumers who can't live without their firewalls.

I think an ITR-ETR should minimising the degradation of
communication as much as possible.  I don't think we should let the
ITR-ETR scheme frequently drop packets - otherwise end-users will
quickly discover that this new class of address space sucks.


>> Still, we can predict that there will be such large packets
>> early on in many communications.  Simply dropping them doesn't
>> seem right to me.  Dropping them with a too-low PMTU value
>> being sent to the sending host would screw up that host's later
>> packets, making them shorter than they really need to be.  I
>> think fragmenting them at first is the best approach.
> 
> If we mandate that *TRs support 1500-byte user traffic without 
> fragmentation this wouldn't be any issue in practice.

But then we wouldn't be able to put ITRs and ETRs in all the places
they need to be.  The whole system would be much more restricted in
scope and would not be deployed as widely or as flexibly.


>>>> Later, if more such packets need to be sent, the ITR and
>>>> ETR can work on determining the real PMTU.  I do this with
>>>> probe packets, rather than traffic packets.
> 
>>> Even more overhead...
> 
>> Yes.  However I can't see a way of probing the PMTU in any
>> other way.  ICMP can't be relied upon, and if I tried to use
>> only traffic packets, I would have to risk those packets not
>> arriving.  Instead, IPTM fragments the traffic packets and
>> sends its own probe packets. This means there is no fancy
>> overhead in traffic packets - they are not intended to be used
>> for PMTUD at all.
> 
> I REALLY don't like this: generating singalling traffic when
> there is no data traffic is a very bad precedent. However, we
> probably need to probe for reachability in some way or another,
> if we can do the MTU stuff along with that i may be tolerable.

In my proposal, and I think in Fred's, there are no probe packets
unless the ITR perceives that there is likely to be further longer
packets to handle.  So yes, there would be one or more large probe
packets.  ITRs might evolve some smarts about what length probe
packets to try, perhaps based on comparing notes with nearby ITRs
who have tunneled to this ETR recently, or by some downloaded
"cheatsheet" suggesting good lengths to try, for each of the BPG
prefixes in which ETRs have been found.

Also, an ITR might be happy with getting a response from an ETR to a
relatively long probe packet and leave it at that - rather than send
another somewhat longer packet, and wait a few seconds before
deciding that was too long.  That would lead to longer delays before
it could inform the sending host with a PTB message, prolonging the
time in which the ITR needs to fragment longer traffic packets.

This might lead to a generalised minor underestimate for PMTUs, in
addition to the overhead imposed by tunneling.  However, that might
be preferable to spending a few seconds pushing larger and larger
probe packets at the ETR until one size is perceived as never
generating a response.

Some of these matters are also discussed in the other thread "MTU,
jumboframes, ITR & ETR placement, ITR function in hosts" - please
see my next message.

 - Robin


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg

Follow-Ups:
- RE: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
  - From: "Templin, Fred L" <Fred.L.Templin@boeing.com>

References:
- [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
  - From: Robin Whittle <rw@firstpr.com.au>
- Re: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
  - From: Iljitsch van Beijnum <iljitsch@muada.com>
- Re: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
  - From: Robin Whittle <rw@firstpr.com.au>
- Re: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
  - From: Iljitsch van Beijnum <iljitsch@muada.com>

Prev by Date: Re: [RRG] MTU, jumboframes, ITR & ETR placement, ITR function in hosts
Next by Date: Re: [RRG] MTU, jumboframes, ITR & ETR placement, ITR function in hosts
Previous by thread: Re: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
Next by thread: RE: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr
Index(es):
- Date
- Thread