[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Sprite & IPTM while PMTU probing is in progress



Hi Fred,

You wrote:

>> I don't clearly understand this.  Iljitsch has not been able to
>> convince me - or anyone else AFAIK - that it it will be practical to
>> insist on massive hardware upgrades broad enough for the wide range
>> of location I think ITRs and ETRs need to be located. (Host to host,
>> Man - Coast to coast. http://www.firstpr.com.au/ip/ivip/tv-ad/)
> 
> Sorry for asking, but do you have a PR department that
> writes this stuff up? :^}

It is more a case of adding some colour to keep me awake.
Routing and addressing is pretty dry, I think.  The LISP folks
may have had the same experience - it seems they were dreaming
of Iphones.


>> Maybe some people envisage a more restricted range, or are more
>> upbeat about the prospects of a global jumboframes and gigabit
>> Ethernet upgrade.
>>
>> For the purposes of discussing sprite-mtu and IPTM, I assume that
>> this upgrade is not practical.
> 
> IMHO, the H/W upgrades will occur independently of any
> tunnel MTU handling. But, the tunnel MTU handling should
> be capable of taking advantage of the larger MTUs when
> they become available.

I agree.  I will try to ensure that IPTM can make use of much
larger MTUs.


>>> Since the ITR cannot know the EMTU_R of the ETR a priori unless
>>> there is some spec that says: "all ETRs MUST configure an EMTU_R
>>> of at least X bytes", the ITR should not simply fragment the
>>> outer packets (or, allow the network to fragment them) since they
>>> could black-hole. 
>>
>> You and most other people on this list have much more experience in
>> these matters than I do, but I don't see why fragmenting packets for
>> a few seconds will lead to a "black hole".
> 
> If the ETR does not have a large-enough EMTU_R, I think
> it would have to drop the packet silently.

Yes, but this doesn't seem pertinent to my critique of sprite-mtu
that it would be better to:

1 - As IPTM does, fragment the outer packet into two smaller
    ones, which will pass through the tunnel OK.  Hmm, this
    embodies an assumption that the outer packet is no bigger
    than 1500 bytes or so, and that all ITRs and ETRs are to
    be located in places where they have at least 1280 (or
    similar) byte MTUs to the DFZ.

2 - As sprite-mtu does, try sending the entire outer packet
    (assuming it fits in the MTU limit of the next hop)
    while also sending a PTB to the SH, with a value
    designed to ensure the SH sends only packets which, once
    encapsulated, will fit within the ITR's current estimate
    of the PMTU to the ETR.  Initially, this estimate is a
    low default value, since the ETR probing is not yet
    complete.

I have been focusing on hosts which assume they can send
packets up to 1500 bytes long, and trying to make the ITR-ETR
system handle these, initially, with fragmentation, but then
use a PTB (or as many as are required before the SH changes
its ways) to get the SH to send somewhat smaller packets.

I need to think more about the future, where hosts might
send really large packets, many times bigger than 1500 bytes,
while the ITR still has to assume a figure of ~1280 for the
MTU to an ETR it hasn't yet probed.

When fragmentation means many more than 2 fragments, your
objection to fragmentation make more sense to me.

Here I will consider jumboframes to be 16114 bytes, according
to Iljitsch's: http://psg.com/lists/rrg/2007/msg00628.html .
Jumboframes could be longer, which is more difficult still.

With ITPM as it stands and a jumboframe-inclined SH, which
sends ~16114 byte packets, the ITR would fragment it into 11
packets.  This greatly increases inefficiencies and the chance
of one being lost - assuming the PMTU to the ETR, and the ETR
itself, was ready for (16114 + ENCAPS) byte packets.

My primary goal is to handle existing hosts - which are assumed
generally not to be jumbo-inclined, and which I guess generally
put out packets of ~1500 bytes, according to their local link
MTUs - in an environment with widespread ITR-ETR use, over links
which may have slightly sub-1500 MTUs themselves, causing the
ITR-ETR tunneling overhead to limit the payload size to some
value ~20 to ~100 bytes below 1500.

I think IPTM does a better job of this than sprite-mtu - because
I think IPTM's fragmenting of larger packets into two is a better
course of action than sending a PTB (and also sending the single
large outer packet).

A secondary goal - which will be really important in the long
run - is to craft IPTM so it works well in a world where
many links, but not all, handle jumboframes, and where hosts
tend to send jumboframes.

One approach is to do something to IPTM in a fixed way to
create happy outcomes in both scenarios.  Another is to
make some requirement of hosts in the future which are
inclined to send jumboframes - I would rather avoid this.

Another is to plan for some overall change to the behavior
of IPTM devices in the future.  This is assuming that there
isn't a single way of optimising for both both scenarios,
so at some point, when it is worthwhile, we switch over to
an alternative mode of operation, different config settings
etc. which is worse for the old scenario, but better for
the jumboframes scenario.


>> I recognise that fragmentation involves several costs:
>>
>> 1 - More work for the routers at each end.
> 
> That brings up another question; were you expecting to
> fragment the packet in the ITR's stack before sending,
> or just send with DF=0 and let the network fragment?
> The literature and recent discussion consensus seems
> to greatly favor the former.

Hmm ... I was intending the ITR to fragment the packet.

This was to avoid a situation where there are two or more
MTU bottlenecks in the path:

  ITR sends 1 x 6000 byte packet.

  R1 (seeing MTU to R2 is 3000 bytes) fragments this to
  3 x 2020 ((6000 / 3) + 20) byte packets.

  R2 (seeing MTU to R3 is 1500 bytes) fragments these to
  6 x 1030 ((2220 / 2) + 20) byte packets.

IPTM would have assumed a ~1280 byte PMTU, until it had
completed probing the ETR, so it would have fragmented
the packet to 5 x 1220 byte packets ((6000 / 5) + 20).

This is marginally more efficient than the above, except
for where the single packet can travel most of the distance
before fragmentation.

Thanks for pointing this out.


>> 2 - Two packets rather than one being handled by all
>>     routers en-route.
> 
> Or more than 2, if the network also fragments.

Yes, but for packets which start off around 1500 bytes, and if
we make a "1280 PMTU to the DFZ" rule for placing all ITRs and
ETRs, we can assume that the packets will only be fragmented
once into two.


>> 3 - More data being sent, due to an extra packet's worth of
>>     overhead.
>>
>> 4 - Most seriously, I think, a greater chance of loss of the total
>>     packet due to its delivery depending on two packets, rather than
>>     one.
>>
>>     Each packet has some risk of being lost, and the risk is
>>     a little higher than would otherwise be the case, since the
>>     two packets take up more time on the wire and more router
>>     resources than the single original packet.
>>
>> 5 - Fragmentation failures due to 16 bit ID wraparound.  You
>>     mentioned this being a problem with IPv4 and high packet rates
>>     with longer delays, causing the wrong packets to be reassembled,
>>     due to the 16 bit counter  wrapping and them having the same
>>     sequence number.
>>
>> Tacking point 5 first, in message 608, quoting Iljitsch, I asked
>> whether this could be largely resolved by a shorter time window for
>> reassembly in the ETR:
>>
>>  - - - -
>>
>>> I don't have any references, but in short, the issue is that you
>>> have a 16 bit ID space with a reassembly timeout of something
>>> like a few minutes. This means you can only send 65536 packets
>>> during that "few minute" window or you'll incorrectly reassemble
>>> fragments from different packets if you lose a fragment. This is
>>> especially problematic if the fragmented packets belong to a
>>> tunnel because in that case the IP source/dest addresses are
>>> always the same.
>>
>> In a fresh system such as an ITR-ETR scheme, perhaps a workaround
>> for this would be to set the maximum reassembly time at the ETR to
>> something very much shorter, such as 1, 2 or 3 seconds?
>>
>> That could be a pain where the ETR is an additional function in a
>> server or router with a pre-existing TCP/IP stack.  Then it would
>> probably not be possible to shorten the time just for the ITR to ETR
>> packets.
>>
>>  - - - -
>>
>> Can anyone comment on the prospects of this being an acceptable
>> solution?
> 
> During some of the discussions (can't recall whether it
> was on- or off-list) some Linux code was shown that uses an
> out-of-order upper bound in terms of number of reassemblies
> ourstanding instead of a shortened timer when deciding whether
> to purge an incomplete reassembly. That seemed like the right
> way to go to me.

OK.  This sounds easier to implement than timers.  Hopefully
someone can rememember some information about this.



> Setting a shorter timer might not be such a good idea for
> reasons outlined in RFC1122, Section 3.3.2. In some earlier
> efforts, I delved into the RFC1122 suggestion of managing
> round-trip times: 
> 
>   + It has been suggested that a cache might be kept of
>   + round-trip times measured by transport protocols for
>   + various destinations, and that these values might be used
>   + to dynamically determine a reasonable reassembly timeout
>   + value.  Further investigation of this approach is
>   + required.
> 
> My conclusion was that a protocol between the ITR and ETR
> could be devised to do this, but it was far too complicated
> for practical purposes and had too many things that could
> go wrong.

It is just the ETR which has to be changed.  This raises
questions of whether the ETR is some special new device
(unlikely) or a conventional PC-like host, which would
typically rely on standard OS packet reassembly, which
could be hard to alter.  A server specifically programmed
to be an ETR might have tweaked networking code in is OS
for this purpose.

Routers may or may not be amenable to tweaking their packet
reassembly operations just because the router is functioning,
in part, as an ETR.


>> In an ITR-ETR scheme, we probably don't want to sit around waiting
>> for packets which somehow get lost and arrive more than a second
>> late via some circuitous route through Manangatang.
> 
> There is also the consideration of whether we want to make
> this delay-tolerant, e.g., for multiple satellite hops,
> interplanetary communications, etc.
> 
>> I will read:
>>
>>   IPv4 Reassembly Errors at High Data Rates
>>   http://tools.ietf.org/html/rfc4963
> 
> OK.
> 
>> The other points don't, to me, constitute a "black hole" or a
>> serious problem.  They seem to involve marginally greater efforts,
>> marginally less efficiency and marginally greater chances of the
>> final packet not being delivered.
>>
>> To me, this is a far better thing to do, for a few seconds or so
>> into a potentially long flow, than dropping packets.
>>
>>
>>> There are also other factors to consider, including that the ITR
>>> may not have ultimate control over the setting of the ip_id. And,
>>> the ETR may not be able to receive non-initial fragments in the
>>> first place. (These factors can be mitigated by the placement of
>>> the ITR and ETR in some use cases, however.)
>>
>> I don't clearly understand these points.  Can you explain 
>> them further?
> 
> If the ITR is located behind a NAT or translating firewall,
> the ip_id could be re-written. I believe a common approach
> is to rewrite with a randomly-chosen 16-bit value. When
> there are many SH's behind the same NAT/firewall and using
> the same ITR, this could present a problem.

In Ivip, ITRs are never behind NAT.  I assume the same is
true of other ITR-ETR schemes.

> About non-initial fragments, I am told that some NATs/firewalls
> simply pass the first fragment and drop all others. Problem
> being that non-initial fragments do not include the transport
> layer header.

I don't fully understand this, but I think it doesn't matter
if we know ITRs are never behind NAT etc.


>>>> A "long" inner packet is one which, once encapsulated, would
>>>> exceed the ITR's current best estimate of the PMTU.  This would
>>>> initially be a default such as 1280 bytes.
>>>>
>>>> If this default value is above the minimum for the protocol,
>>>> eg. 576 for IPv4, then this value of PMTU to the "core" of the
>>>> net must be available to every ITR and ETR and would be part of
>>>> the specification of the ITR-ETR scheme.
>>>
>>> The pathMTU cannot be known a priori; all that can be known a
>>> priori is the EMTU_R of the ETR if there is a specification for
>>> the minimum size.
>>
>> Yes.  I envisage some value around 1280 - at least something so that
>> a 1500, 1520 etc. byte packet would only be fragmented into 
>> two packets.
> 
> Unless some new spec comes along, the only values for
> EMTU_R that can be assumed are 576 bytes for IPv4 and
> 1500 bytes for IPv6.

The new specification would be part of the ITR-ETR scheme's RFC.
For instance:

   ITRs and ETRs must be located so that their PMTU to the
   DFZ is at least 1280 bytes.

The number would need to be carefully chosen.  I use 1280
because it is substantially below 1500 - and so able to
accomodate several layers of tunneling.

The trick would be to make the number reasonably high, to
reduce the number of initial packets the IPTM ITR would need
to fragment, or worry about at all in terms of PMTU, without
unnecessarily restricting the location of ITRs and ETRs.


>>>> The default value would be replaced by a higher value once the
>>>> probe process was complete.  The pattern would be something
>>>> like this, assuming the SH's initial idea of PMTU to the
>>>> Destination Host was 1500.  I will assume an ENCAPS overhead of
>>>> 20 bytes (as with IPv4 Ivip, though other ITR-ETR schemes have
>>>> higher overheads) and that all ITRs and ETRs are located so
>>>> they have an MTU of at least 1280 from the DFZ.
>>>
>>> Do you mean to say "pathMTU" or "linkMTU"? In terms of "pathMTU",
>>> do we need to consider links with configurable linkMTUs that
>>> might have either mis-configured or overly- conservative values?
>>
>> I don't understand this clearly enough to respond at present.
> 
> May not have been worded very well. A while back, I think
> it was Iljitsch who indirectly pointed out that any link
> for which an MTU can be manually configured can also be
> misconfigured. For example, what if an operator means to
> set a linkMTU of 4500, but his finger slips and he ends
> up setting only 450? Flows that use that link would
> experience the inefficiency - but, they should still work. 

I don't think a major architectural addition like an ITR-ETR
scheme with its sprite-mtu or IPTM or whatever PMTU management
system should be required to cope gracefully with such
misconfiguation.


>>    Inner      ITR action on           SH's idea  DH gets
>>    packet     outer packet            of PMTU
>>    length     following encapsul-     to DH
>>               ation of inner packet
>>
>>                                       1500
>>
>>
>>    200       Send outer packet -      1500       The packet
>>              the length is less than
>>              1280.
>>
>>   1500       Fragment packet and      1500       The packet
>>              commence probing                    (Less efficient
>>                                                  and more error-
>>                                                  prone tunnel with
>>                                                  2 packets instead
>>                                                  of 1, but this is
>>                                                  only for a few
>>                                                  seconds, I hope.)
>>
>>   1500       Fragment packet and      1500       The packet
>>              continue probing
>>
>>              ... etc.
>>
>>> This could black hole and appear as congestion-related loss to
>>> the SH.
>>
>> As noted above, I foresee only marginal efficiency and reliability
>> impact, where you see unacceptable packet losses.
> 
> It's not so much the loss I am concerned with; IMHO,
> sustained and unmitigated fragmentation is dangerous
> even for short periods of time.

My feeling - and you have much more experience in this than I do -
is that short bursts of fragmentation into two packets is a better
choice than instantly telling the sending host to send much shorter
packets than it probably can send.

But this is with my assumption the SH is only putting out packets
about 1550 bytes long.  It doesn't hold if hosts are pushing out
jumboframe packets over 2000 bytes or so.


>>              Probing complete:
>>              PMTU to ETR decided
>>              to be 1460.
>>
>>   1400       Send outer packet -      1500       The packet
>>              the length is <= 1460.
>>              (This length would not
>>              necessarily be sent - it
>>              is just to show that the
>>              ITR will now send longer
>>              packets without frag-
>>              mentation than before.)
>>
>>   1500       Drop the packet and send
>>              the SH a PTB message
>>              with value 1440.         1440       Nothing, but the
>>                                                  ITR is usually
>>                                                  close to the SH,
>>                                                  and it doesn't
>>                                                  take long for...
>>
>>> Depending on the placement of the ITR, this PTB might not make it
>>> back to the SH.
>>
>> If there is a filter blocking PTB messages from the ITR to the SH,
>> then I think ordinary, non-RFC4821, PMTUD would be clobbered anyway
>> - with or without an ITR-ETR scheme  This situation would not
>> persist, I think when an ITR-ETR scheme was widely deployed, as I
>> discussed in:
>>
>>   http://psg.com/lists/rrg/2007/msg00636.html
> 
> See RFC2923 for a discussion of PTB filtering implications
> for classical PMTUD. 

I think I have read this, but I will read it again.


>>   1440      Send the outer packet -  1440        The packet
>>             the length is <= 1460                (Now the tunnel
>>                                                  is handling
>>                                                  optimal length
>>                                                  packets.)
>>
>>>> This pattern would continue unless the ITR, with periodic
>>>> probing, decides that the PMTU is less than 1460 (it might do
>>>> this quickly if it got a PTB message from a router in a new,
>>>> more MTU-challenged, path to the ETR), and if the SH sends a
>>>> packet which would be too big for the new lower value of PMTU.
>>>> Then the ITR would send another another PTB message to the SH,
>>>> with a lower value than 1440.
>>>
>>> There is not strictly any periodic probing needed to detect 
>>> pathMTU reductions, since the data packets serve as virtual 
>>> probes. The data packets will be lost and might be considered by
>>> the SH as congestion-related loss if the PTB can't be translated
>>> by the ITR and sent back to the SH. 
>>
>> In my IPTM proposal, the ITR would only get PTB messages from
>> routers in the tunnel if the outer source address was that of the
>> ITR.  This would be the case if IPTM was applied to LISP, eFIT-APT
>> or TRRP.  Ivip uses the sending host's address in the outer header,
>> so the ITR would never get a PTB message if the PMTU to the ETR
>> suddenly became lower than what it had assumed, based on previous
>> explicit probes acknowledge by the ETR.
> 
> I believe this represents a departure from some of the other
> proposals. 

It is, for reasons including:

  1 - Making it relatively easy for the ETR to enforce the
      ISP's filtering which rejects incoming packets with
      source addresses from its own network - extending
      this protection to the inner packet, by refusing to
      forward to the destination host any packet where the
      inner source address is different from the outer
      source address.

         http://tools.ietf.org/html/draft-whittle-ivip-arch-00#section-14.1

  2 - Enabling the SH to perform traceroute, with a somewhat
      modified traceroute program.

but this choice of outer source address = sending host address is
also tied in with some other decisions in which Ivip is different
from the other proposals to date.


> Are you suggesting that the SH will always be
> globally addressable from within the core? So, there is no
> locator/id split? No public addressing for the ITR and
> private addressing for the SH? No IP version mismatches
> between the SH and ITR?

The SH may be behind NAT, but since the ITR is betwen the NAT and
the DFZ, the ITR regards the SH's address as that of the NAT box.

SHs can be on ordinary addresses, receiving packets directly via
the current BGP system, or they can be on Ivip-mapped addresses,
receiving packets via ITRs and an ETR.  In Ivip, the mapped
address blocks are all advertised in BGP, but with an "anycast"
arrangement where multiple (hundreds of thousands, for instance)
ITRDs (full database ITRs) advertise each mapped address block,
causing packets from sending hosts to go to the nearest such ITRD.

I don't think of Ivip as a locator-ID separation protocol.  It
probably is, but I see it as more of an additional system of
plumbing - to do a much finer, and more rapidly flexible in time,
job of splitting up the address space and getting packets to
ETRs near the destination hosts than is practical with BGP.


> In any event, assuming that the outer source is that of
> the SH (and not the ITR), then the PTBs coming from within
> the tunnel and delivered directly to the SH will forever
> report a too-large MTU because the SH has no way of knowing
> that an ITR on the path will be inserting ENCAPS overhead.
> Even worse, the data included in the PTB will include the
> ENCAPS inserted by the ITR and will be unrecognizable
> to the SH.

Yes, so a correctly implemented SH should will ignore them - so
these PTBs will cause no trouble.

The SH needs to get a PTB from the ITR, which is what IPTM
is intended to achieve.

>> This would cause a black hole for all packets which, once
>> encapsulated, were longer than the new, lower, PMTU between the ITR
>> and ETR.  To guard against this, the ITR needs to periodically probe
>> an ETR to which it is continually sending long packets.  There could
>> be other techniques, such as an ITR-ETR protocol which enables the
>> ITR to receive acknowledgement of packets it sends, but this gets
>> pretty complex, and I am hoping to avoid such things.
> 
> Unless the period for probing is very short, there is
> opportunity for a self-sustaining denial-of-service to
> the SH. 

I don't see how there could be a DoS attack as such.

Unless I add something to IPTM to have the ITR receive
acknowledgements, say every 10 to 60 seconds, from the ETR,
then the ITR has no way of knowing that the tunnel is
dropping packets, due to the PMTU to the ETR becoming
lower than the value discovered during probing.

Another approach is to have more frequent ITR probing of
the ETR, as you suggest.

I intend IPTM to be usable by other schemes than Ivip.  For
those schemes, the IPTM ITR would usually be able to detect
that the tunnel PMTU had been reduced, since the ITR would
get the PTB messages from one of the routers in the tunnel.

However, this is not robust, because there could be a filter
between the ETR and ITR which might prevent the PTB packets
from reaching the ITR.

IPTM for Ivip doesn't care about any filtering of PTB
messages between the ITR and the ETR.  It relies entirely
on explicit probes from the ITR to the ETR.


>>> But, the ITR will be able to return the correct PTB when the SH
>>> retransmits. This is the same as for sprite-mtu.
>>
>> Once the ITR has a correct idea of the new, lower, PMTU to the ETR,
>> it can drop (IPTM) or send (sprite-mtu) the packet, and generate a
>> PTB to the SH when the SH next sends a packet which, once
>> encapsulated, would be too long for that new PMTU.
> 
> This loses out on the opportunity to use the packets
> sent into the tunnel as probes to detect MTU increases.
> And, if you are expecting your ITR to periodically
> re-probe to detect MTU increases in time to satisfy
> MTU-probing SH, it is going to require excessive probing
> overhead and may not get the job done in time to avoid
> black-holing. 

I am not too fussed about how rapidly the ITR detects an
increase in PMTU to the ETR.  Delay in discovering this only
means a period of lost opportunity for increased efficiency.

Delay in discovering a shorter PMTU is a much more serious
problem, because that leads to black holes of complete
packet loss.

Of course, if RFC4821 is widely implemented, the SH will
discover the packet loss promptly and try shorter packets.

I think RFC4821 is basically a good thing.  IPTM doesn't
get in its way, and in this particular packet loss situation
which IPTM as currently defined would create, RFC4821 would
make this limitation of IPTM not much of a problem.

However, IPTM doesn't require or (I think) otherwise benefit
from RFC4821.


>>>> Alternatively, occasional probing by the ITR might discover a
>>>> higher value of PMTU to this ETR, and the SH could discover
>>>> this increase by trying its luck with a larger packet - and
>>>> either having it accepted, or rejected with a PTB containing
>>>> the new higher value, minus 20.
>>>
>>> SHs that don't implement RFC4821 will have to wait for a long
>>> time before trying a larger packet (RFCs 1191 and 1981 say 10min,
>>> I believe). 
>>
>> Indeed.  This is why my critique of sprite-mtu seems to be
>> important.  As I understand it, the SH first gets a rather low value
>> in the PTB message from the ITR, at a part of the exchange at which
>> an IPTM ITR would have simply fragmented the packet without any PTB,
>> and commenced or continued probing the ETR.
> 
> That depends on what you consider to be "low". Depending
> on the placement of the ITR, "low" will only go as low
> as 1280 but may go as large as, e.g., 1480. 

Ignoring for the moment IPTM's current difficulty in coping
nicely with SHs and tunnel paths which handle jumboframes, my
critique of sprite-mtu is one of the relative efficiency,
between, for instance:

 1 - IPTM fragmenting the first long packets while it probes
     the ETR, and then settling down to a nice ca. 1460
     packet length from the SH, after the ITR sends a PTB
     message to it.

 2 - Sprite-mtu, as I understand it, first sending a PTB of
     ca. 1260, and the SH using this for the next 10 minutes
     before it is allowed to try sending a longer packet.

How high can we make this ~1280 default figure?  The higher
the better, as long as we don't unnecessarily restrict the
placement of ITRs and ETRs.

However, since many or most flows of information last less
than 10 minutes, and since most SHs (in my view) will not
be using RFC4821, this means that with sprite-mtu, most
traffic would be stuck with the default value MTU, rather
than whatever higher value the ITR discovers with a few
seconds of probing.

I think sprite-mtu only works well if the SH uses RFC4821.

An RFC4821 SH won't be negatively affected by the ITR both
sending it a PTB message (with a PMTU value ca. 1260 or so,
as per my example - or perhaps (576 - ENCAPS) because
if the packet it sent was not too long for the tunnel (once
encapsulated) then it would get end-to-end RFC4821
confirmation of delivery, and so ignore the PTB message.

If it took notice of the initial PTB message(s), an RFC4821
SH is still able to fight its way out of this inefficient
situation (sending packets of ~1260 or less bytes) by
trying a longer packet length in a shorter time than allowed
for non RFC4821 compliant hosts.


>> If it takes at least 10 minutes for a non-RFC4821 compliant host to
>> try sending larger packets, then this is a long time for the
>> communication to be restricted to the shorter packets.
> 
> SHs are therefore advised to begin implementing RFC4821.
> Deployment is incremental and involves only the SH.

I think this makes the acceptable behavior of the ITR-ETR
scheme (at least in terms of solving the PMTUD problem
which bedevils every such scheme) dependent on host changes.

This is at odds with my vision of incremental deployability.

I wrote about why I think RFC4821 is a very demanding host
change which I think is unlikely to be widely adopted in
the time frame in which an ITR-ETR scheme needs to be
introduced, which I see as being within 5 years.

Does anyone argue that RFC4821 adoption in desktops and
servers, are actually going to be widespread in the next
5 or 10 years?  I know it is a new RFC, but what is the
current status of implementation, in operating systems
(TCP) and in applications?  Applications can only do it
if the OS supports it.


>> I am not assuming widespread adoption of RFC4821 at any time.  It
>> looks really complex to implement, involving applications and the
>> TCP layer communicating with a new function in the OS in ways which
>> were not originally part of the protocol stack.   Writing all this
>> code, for marginal immediate benefit, and then trying to debug it in
>> all its possible combinations of applications, live network settings
>> etc. sounds really, really, complex.
> 
> I disagree; active end-to-end involvement in MTU determination
> is important for the long term.

I agree it is important in the long term, but is it
really going to happen in the next few years?

Sufficiently for it to be widely enough deployed that
those SHs which don't have it and therefore don't work
well with sprite-mtu, will experience persistent PMTU
difficulties as a result of  the introduction of the
ITR-ETR scheme?

If there are a significant number of hosts (maybe as
few as 10% or less) which are not up-to-speed with
RFC4821, in the early days of introducing an ITR-ETR
scheme (I hope 2010 or 2011), then I fear this would
have the the effect of making addresses mapped by
the ITR-ETR scheme suck - which would create an
impenetrable barrier to the introduction of the scheme,
and therefore doom our beloved Internet to eternal
twilight and doom.  (Or boost the fortunes of Cisco
et al. as everyone buys new routers to cope with
millions of DFZ routes - the demand for IPv4 space
in smaller increments is sure to drive growth in
advertised prefixes up precipitously in the next
5 to 10 years.)

...

>>>> It seems strange to me to send the packet (unfragmented, I
>>>> assume) while also sending back a PTB message to the sending
>>>> host.  Wouldn't this cause needless traffic and/or confusing
>>>> signals to the SH if the outer packet does in fact arrive at
>>>> the ETR and therefore the inner packet is delivered to the
>>>> destination host?
>>>
>>> To the SH, it would appear that there is a router on the path
>>> returning inaccurate information. This can happen already today,
>>> since routers can be misconfigured, and spoofed PTBs can be sent
>>> from any node in the network.
>>
>> It still seems strange, confusing and inefficient to me.
> 
> I disagree; there is value in sending the packet into the
> tunnel on at least three levels: 1) it serves as a virtual
> probe so that the ITR can detect MTU restrictions further
> down the tunnel, 2) the packet may be an MTU probe of the
> SH, 3) Packet delivery ratio may benefit in some use cases. 

I can see this value, although IPTM at present has no way of
using encapsualated traffic packets as probes.  I would have
to add an ITR-ETR protocol for that to be available - and
that would involve either adding a special header or trailer
to the encapsulated packet (so the ETR could uniquely
identify it), or devising some robust scheme by which the ETR
could tell the ITR exactly which packets it received, based
entirely on the naturally occuring characteristics of the
encapsulated packets.  Either option involves the ITR in
a bunch of record keeping and communications guff with the
ETR - and I am trying to keep this really lightweight.


>>> SHs that implement RFC4821 should not have a problem 
>>> deconflicting the (suspect) PTB information from (authentic) 
>>> end-to-end feedback from the DH, but should benefit from the PTB
>>> info when the actual data is not delivered to the DH.
>>
>> Yes, but I am assuming that none, or few, hosts will implement
>> RF4821 any time soon.
> 
> Incrementally deployable; touches SH only; realizing
> larger MTUs gives incentive for deployment.
>  
>>> ITRs can help the situation by sending sprites of, e.g., 1500
>>> bytes into the tunnel early in the process so that most if not
>>> all SHs that use the tunnel will see a 1500 byte or larger MTU.
>>
>> Does "early in the process" mean when only shorter packets have so
>> far needed to be tunneled to the ETR?
>>
>> If so, then the ITRs could be generating large volumes (in bytes) of
>> probe packets in response to only small traffic flows, and to some
>> flows which never in fact require PMTU knowledge, since the flows
>> never actually use long packets.
> 
> There is nothing mandated here, and implementations will
> be evaluated on the merits of their probing strategies. 

I think an RFC for sprite-mtu or IPTM should give some guidance
on when to probe, but allow the ITR to make its own decisions
and/or be configured to suit local conditions.


>>>> Here I will assume IPv4 only, with 1280 bytes for the default
>>>> PMTU for every ETR the ITR has not yet probed.  I will also
>>>> assume an encapsulation overhead of 20, although this would
>>>> typically be higher for Sprite and non-Ivip ITR-ETR schemes.
>>> I don't understand "higher for sprite-mtu"?

>> This was a low-key aside.  In my second example, trying to explain
>> how I thought sprite-mtu might work, I kept the same 20 byte
>> encapsulation overhead I used in my first example, which was for
>> IPTM, assuming Ivip's 20 byte overhead.
>>
>> Other ITR-ETR schemes, and I guess most other tunneling schemes
>> sprite-mtu would be applied to, have higher encapsulation overheads,
>> I think.
> 
> The ENCAPS overhead is orthogonal to the use- or non-use
> of sprite-mtu.

I was just explaining my continuance of 20 as the ENCAPs value
from one example to the next.  However, I understand that
sprite-mtu does, however, involve extra trailers on some
traffic packets.


>>>> If the ITR sends a PTB message to the SH when the first packet
>>>> (or multiple packets) length exceeds the default PMTU value and
>>>> then, after probing, decides the PMTU is 1480, then I am
>>>> concerned that the SH would get contradictory values in these
>>>> PTB messages.
>>>>
>>>> At first the SH would be told to send packets no longer than
>>>> (1280 - 20 = 1260) and later, it would be told to send packets
>>>> no longer than (1480 - 20 = 1460).
>>>
>>> Note: in the next draft version I would like to rewrite the
>>> second bullet of Section 5.6.4 as:
>>>
>>> o  for IPv4/*/IPv4 tunnels, 'pathMTU' is less than MIN(EMTU_R,
>>>    1280+ENCAPS) bytes and the inner IPv4 packet is no larger than
>>>    MIN(EMTU_R-ENCAPS, 1280).
>>
>> The current version is:
>>
>>   o  for IPv4/*/IPv4 tunnels, 'pathMTU' is less than MIN(EMTU_R,
>>      1280) bytes and the inner IPv4 packet is no larger than
>>      MIN(EMTU_R, 1280) minus ENCAPS.  (When EMTU_R for the TFE is
>>      not known, 576 bytes must be assumed.)
>>
>> OK.  My eyes are glazing over right now.
> 
> Reaching that point also myself...

Indeed.  Is anyone else keeping up with this epic correspondence?


>>              SH breaks the message
>>              into smaller packets
>>              and retries:
>>
>>   1260       Send packet and          1260       The packet
>>              continue probing
>>
>>              ... etc.
>>
>>              Probing complete:
>>              PMTU to ETR decided
>>              to be 1460.
>>
>>> By probing, do you mean by the ITR or by the SH? 
>>
>> I meant the ITR sends sprite probes to the ETR.
> 
> Do you mean sprite-mtu probes and not traceroutes? One
> thing I do not understand about your proposal is how
> you expect the traceroutes to be efficient and converge
> within a reasonable amount of time? Also, there is no
> guarantee of getting the ICMPs back from the network
> middleboxes. Am I missing something?

This example is to explore my understanding of sprite-mtu.

Above, I mean that the ITR uses special probe packets to the
ETR, called "sprites".  I don't recall traceroute being a
part of your proposal.

In IPTM, the ITRs or ETRs don't do traceroute.  With Ivip's
"outer source address = sending host address" approach,
I propose that a modified traceroute program should be able
to trace all routers, including those in the tunnel, assuming
the ITR replicates the TTL value when encapsulating.  This
would involve moderate changes to the traceroute code to
recognise the ICMP packets which come back from the tunnel,
which are for the outer packet, not the inner as sent by
the traceroute program.  Such a traceroute program would be
able to depict which routers were in the tunnel, and also
determine the ETR address, from the ICMP messages coming
back from routers in the tunnel.  I have only looked at
this quickly, so perhaps there are problems with this
proposal.

Enabling traceroute from the SH, all the way through
the ITR-ETR tunnel, is not possible with other ITR-ETR
schemes unless the ITR performs heroics which I think are
prohibitively demanding.  Traceroute is a valuable
debugging tool and as far as I know (someone confirmed
this on-list) no applications rely on traceroute.
administrative and debugging benefit


>>> I am assuming that SHs will begin using RFC4821 and will
>>> probe the path for themselves independent of any probing
>>> done by the ITR.
>>
>> I am not assuming hosts will be any different than they are today.
>> As far as I know, few, if any, implement RFC4821.
> 
> If hosts see the value of larger MTUs, they will begin
> to deploy RFC4821. (Or, if vendors see the value, they
> will begin to push out RFC4821 in automated S/W updates.)
> IMHO, PMTUD cannot be efficiently and correctly handled
> within network middleboxes alone; end-to-end involvement
> is needed as well.

As noted above, I think it is a great idea, but it is
very complex and requires applications and OS to work
together.  So the OS framework and TCP stuff needs to be
done before an application could be made to work with it.

I have no idea to what extent this is being done, but it
sounds very complex and likely to happen slowly at best.


>> If you are assuming this, I think it would be good to make it an
>> explicit condition you are designing sprite-mtu to function within.
> 
> sprite-mtu works independently of RFC4821, but end systems
> benefit from using RFC4821.

But without RFC4821 in the SH, as far as I know, the SH
will be constrained for 10 minutes at least to the PMTU
value it gets in the first PTB message sent by the ITR.

This is for a lower value than the ITR would send after a
few seconds probing.

So I think an RFC4821 compliant SH is necessary for your
system to deliver communications with packet lengths
longer than the low, default, value the ITR must assume
at first.

If my understanding is correct, I think it would be good
to state this in some way in later versions of your ID.


>>   1260       Send outer packet -      1260       The packet
>>              the length is <= 1460.
>>
>>   1260       SH would probably keep              The packet -
>>              sending packets of                  but more and
>>              length <= 1260.  Unless             shorter
>>               the SH was pushy, it               packets than
>>              would never discover the            the ITR-ETR
>>              PMTU it could use was               tunnel can
>>              in fact 1440.                       handle.
>>
>>> IMHO, SHs that use RFC4821 can be "pushy" within reason.
>>
>> Yes, but I think it will be a long time before there are many such
>> hosts.
> 
> Why not?

I already discussed the complex software requirements for
implementing RFC4821 in hosts, including the establishemnt of
new paths of two-way communication between applications and
whatever part of the OS network code is responsible for
RFC4821.

I think that requires new standards in OS calls, and I guess
ideally some OS independent way of writing applications
which handles this stuff.


>>> Maybe I should add something about this to the spec?
>> Yes, I think the more explanation of where numbers like 1280 come
>> from, the better.
> 
> OK.

  Cheers

    - Robin


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg