[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = sending host's addr



Having been away from e-mail and just now going through
this detailed discussion, here are just a few related
thoughts:

- the ITR has no control over the pathMTU; all it can know
  a priori is the EMTU_R of each ETR. For IPv6, the minimum
  EMTU_R is 1500 bytes (see: RFC2460 and RFC4213) and for
  IPv4 the minimum EMTU_R is only 576 bytes (see: RFC1122).
  If there is a specification that says (sic) "all RFC(foo)
  compliant ETRs MUST configure an EMTU_R of at least X bytes
  (with X larger than the minimum) then the ITR can assume
  the larger size; otherwise, it can't.

- due to the fragment misassociation/reassembly issues
  mentioned by Iljitsch (and discussed in detail in RFC4963)
  excessive outer fragmentation for the encapsulated packets
  the ITR admits into the tunnel should be avoided, since the
  ETR will have to reassemble these. And again, the ITR cannot
  assume an EMTU_R for the ETR that is larger than the minumums
  listed above unless there is a spec that says that the ETR
  must configure the larger EMTU_R.

- the ITR might not be able to return a "packet too big"
  to the original source if it receives a PTB from a router
  inside of the tunnel, since the router might not return
  enough of the original packet in the ICMP message. Because
  of this, the ITR should send an explicit probe to the ETR
  in parallel with a data message that is larger than some
  previously probed size, since the ITR will get explicit
  feedback from the ETR if the probe succeeds.

- the worst case is for paths with fast links that configure
  small MTUs. In that case, the ITR and ETR need to engage in
  a soft state synchronization protocol to mitigate fragment
  misassociation/reassembly issues.

Fred
fred.l.templin@boeing.com   

> -----Original Message-----
> From: Robin Whittle [mailto:rw@firstpr.com.au] 
> Sent: Sunday, November 25, 2007 5:49 AM
> To: Routing Research Group list
> Cc: Iljitsch van Beijnum
> Subject: Re: [RRG] PMTUD, Sprite & IPTM; Outer src-addr = 
> sending host's addr
> 
> Hi Iljitsch,
> 
> Thanks for your response and offer of the correspondence leading to
> Fred's proposal - I won't read that just now.
> 
> Hopefully some other folks can scrutinise Fred's proposal and mine:
> 
>   http://tools.ietf.org/html/draft-templin-inetmtu-06
>   http://www.firstpr.com.au/ip/ivip/pmtud-frag/
> 
> ITR-ETR schemes are fresh, and definitely need some kind of PMTUD
> system, so it should be fine to build in special functions at 
> both ends.
> 
> Thanks for your account of the fragmentation reassembly problems of
> IPv4.
> 
> > I don't have any references, but in short, the issue is that you
> > have a 16 bit ID space with a reassembly timeout of something
> > like a few minutes. This means you can only send 65536 packets
> > during that "few minute" window or you'll incorrectly reassemble
> > fragments from different packets if you lose a fragment. This is
> > especially problematic if the fragmented packets belong to a
> > tunnel because in that case the IP source/dest addresses are
> > always the same.
> 
> In a fresh system such as an ITR-ETR scheme, perhaps a workaround
> for this would be to set the maximum reassembly time at the ETR to
> something very much shorter, such as 1, 2 or 3 seconds?
> 
> That could be a pain where the ETR is an additional function in a
> server or router with a pre-existing TCP/IP stack.  Then it would
> probably not be possible to shorten the time just for the ITR to ETR
> packets.
> 
> 
> >>> It also costs you lots of CPU and could even allow for CPU 
> >>> exhaustion attacks.
> > 
> >> Yes, but I think it is better than dropping longish packets
> >> just because we assume some too low PMTU of 1280 or whatever,
> >> when in fact, within a second or two, the ITR will probably be
> >> able to establish that the real PMTU is 1500 or somewhat less.
> > 
> > Who said anything about preemptively dropping packets? Just send 
> > 1500-byte packets + an outer header with DF set and you'll get a
> > "too big". After that, you know the path MTU and you can in turn
> > send too bigs to the source of the original packets.
> 
> OK - here's the Devil's Advocate view:
> 
> Firstly, the ITR can't be sure it will get a Packet Too Big (PTB)
> message - that could be dropped.
> 
> Secondly, there could be another lower MTU limit beyond whatever
> sent the first PTB message.
> 
> Your algorithm, as I understand it, does not involve the ITR
> fragmenting any packets.  All the ITR can do when it gets the PTB
> message is send a PTB back to the sending host.  This really slows
> things down, since the sending host has to create another shorter
> packet and the whole process repeats itself.
> 
> With my IPTM approach, as long as the ITR doesn't yet have a
> reliable estimate of the PMTU to the ETR (which it takes a few
> seconds to establish with an explicit probe protocol, not using
> traffic packets, and not relying on PTB messages), it fragments any
> packet which is longer (after encapsulation) than the default PMTU,
> which I suggest be something like 1280 bytes.  This only persists
> for a few seconds while the probing takes place.  After that, the
> ITR has a good idea of the PMTU and then sends a PTB to the sending
> host when it next gets a packet which would exceed this, once
> encapsulated.
> 
> In my scheme, when the sending host does get an PTB message, it gets
> it nearly instantly from the (typically) nearby ITR, rather than
> from the ITR in your scheme with a long delay due to the long path
> the encapsulated traffic packet took from the ITR to some router in
> the tunnel and the delay of the PTB returning to the ITR.  Maybe you
> don't think much of this in Europe or the USA, but from Australia
> these delays are significant.
> 
> The round trip to your Netherlands based mailserver 83.149.65.1 is
> 162ms from my web server in San Francisco, and 343ms from a
> well-connected (ISP, not via DSL) server in Australia.
> 
> So my approach shouldn't involve significant delay in packets,
> whereas yours does significantly delay the first "long" packet.
> 
> If there is a second shorter MTU limit, the packet which hits that
> will be delayed similarly as well.  These packets would be lost if
> the PTB message didn't get back to the ITR.
> 
> Also, in Ivip, traffic packets from the ITR to the ETR sent with the
> sending host as the outer source address, not the ITR's address.
> (So any PTB message would go to the sending host, which would ignore
> it, since the destination address is of the ETR, not of the
> destination host.)
> 
> In Ivip, an ITR couldn't use traffic packets as probes for PMTU to
> the ETR - at least in a system which relied on PTB messages.  Using
> traffic packets as probes would require some special signalling to
> the ETR, in another packet or in a more complex header of the
> tunneled packet, to tell the ETR to send an acknowledgement to the
> ITR.  I think this complexifies the encapsulation system - so I use
> a completely separate probe protocol, between the ITR and the ETR,
> which has nothing to do with traffic packets.  I would rather add
> probe packets to the total system load than have the lost effort,
> delayed delivery and additional complications of using traffic
> packets as probes.
> 
> With synthetic probes, the ITR can push its luck as far as it likes
> with larger packets until one size doesn't generate
> acknowledgements.  This is a very robust way of determining PMTU,
> without relying on PTB packets - though of course the ITR would use
> any which it received.
> 
> If you can only use traffic packets to probe the PMTU, you are
> restricted to the lengths which result from their length and the
> encapsulation overhead.
> 
> 
> My approach is:
> 
>    While PMTU to the ETR is being established:
> 
>       Fragment traffic packets which (after encapsulation) are
>       longer than the default PMTU - even if the DF flag is set,
>       since this is only within the tunnel.  This gets the data
>       through, in most cases.  Reassembly is in the ETR, so the
>       destination host gets one packet.
> 
>    Once this has occurred, probe the PMTU to the ETR with a specific
>    protocol and artificial probe packets - not with traffic packets.
> 
>    Then the ITR can send a reliable PTB message to the source, in
>    the likely event that it gets another packet which is too long
>    to fit in the tunnel, once encapsulated.
> 
>    The result is that all packets are delivered, except the first
>    one for which the ITR sends back a PTB message.  Then the sending
>    host has been reliably informed of the entire PMTU situation and
>    adjusts subsequent packets accordingly.  There is no recursing
>    of your algorithm, with delayed delivery in each instance, as
>    would be required if there was first a 1450 MTU limit and then
>    further along the tunnel a 1400 limit (unrealistic examples).
> 
> If the ETR simply forwards packets to the destination host, and
> there are PMTU limits between the ETR and the destination host,
> these are dealt with by the sending host, not the ETR or ITR.  If
> the ETR tunnels to the destination host, then that tunnel mechanism
> needs to handle the MTU problems - such as by using Sprite.
> 
> The task of my scheme is purely related to the tunnel between one
> ITR and one ETR.  The same ETR could have multiple destination
> hosts, each with their own MTU limits.
> 
> 
> > Yes, this allows for PMTUD black holes, but those are subject to
> > the "so don't do that and the problem goes away" doctrine. ISPs
> > generally get this, unlike enterprise people and ignorant
> > consumers who can't live without their firewalls.
> 
> I think an ITR-ETR should minimising the degradation of
> communication as much as possible.  I don't think we should let the
> ITR-ETR scheme frequently drop packets - otherwise end-users will
> quickly discover that this new class of address space sucks.
> 
> 
> >> Still, we can predict that there will be such large packets
> >> early on in many communications.  Simply dropping them doesn't
> >> seem right to me.  Dropping them with a too-low PMTU value
> >> being sent to the sending host would screw up that host's later
> >> packets, making them shorter than they really need to be.  I
> >> think fragmenting them at first is the best approach.
> > 
> > If we mandate that *TRs support 1500-byte user traffic without 
> > fragmentation this wouldn't be any issue in practice.
> 
> But then we wouldn't be able to put ITRs and ETRs in all the places
> they need to be.  The whole system would be much more restricted in
> scope and would not be deployed as widely or as flexibly.
> 
> 
> >>>> Later, if more such packets need to be sent, the ITR and
> >>>> ETR can work on determining the real PMTU.  I do this with
> >>>> probe packets, rather than traffic packets.
> > 
> >>> Even more overhead...
> > 
> >> Yes.  However I can't see a way of probing the PMTU in any
> >> other way.  ICMP can't be relied upon, and if I tried to use
> >> only traffic packets, I would have to risk those packets not
> >> arriving.  Instead, IPTM fragments the traffic packets and
> >> sends its own probe packets. This means there is no fancy
> >> overhead in traffic packets - they are not intended to be used
> >> for PMTUD at all.
> > 
> > I REALLY don't like this: generating singalling traffic when
> > there is no data traffic is a very bad precedent. However, we
> > probably need to probe for reachability in some way or another,
> > if we can do the MTU stuff along with that i may be tolerable.
> 
> In my proposal, and I think in Fred's, there are no probe packets
> unless the ITR perceives that there is likely to be further longer
> packets to handle.  So yes, there would be one or more large probe
> packets.  ITRs might evolve some smarts about what length probe
> packets to try, perhaps based on comparing notes with nearby ITRs
> who have tunneled to this ETR recently, or by some downloaded
> "cheatsheet" suggesting good lengths to try, for each of the BPG
> prefixes in which ETRs have been found.
> 
> Also, an ITR might be happy with getting a response from an ETR to a
> relatively long probe packet and leave it at that - rather than send
> another somewhat longer packet, and wait a few seconds before
> deciding that was too long.  That would lead to longer delays before
> it could inform the sending host with a PTB message, prolonging the
> time in which the ITR needs to fragment longer traffic packets.
> 
> This might lead to a generalised minor underestimate for PMTUs, in
> addition to the overhead imposed by tunneling.  However, that might
> be preferable to spending a few seconds pushing larger and larger
> probe packets at the ETR until one size is perceived as never
> generating a response.
> 
> Some of these matters are also discussed in the other thread "MTU,
> jumboframes, ITR & ETR placement, ITR function in hosts" - please
> see my next message.
> 
>  - Robin
> 
> 
> --
> to unsubscribe send a message to rrg-request@psg.com with the
> word 'unsubscribe' in a single line as the message text body.
> archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg
> 

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg