[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Perplexing PMTUD and packet length observations



Hi Iljitsch,

You wrote:

> Ok, so this is not the same subnet, right? Note that if you feed
> tcpdump a few -v's you don't have to do as much header decoding in
> your head.

Please take a look at my page, which documents the situation
clearly, complete with the commands I give tcpdump:

http://www.firstpr.com.au/ip/ivip/ipv4-bits/actual-packets.html#jumbo1


> What it looks like to me is that you're actually tcpdumping rather
> than ipdumping: what you see is an initial two segment
> transmission but as a single packet. Could it be that you're
> tcpdumping some virtual interface rather than real packets on the
> wire?

A complete packet dump, with lengths easily visible in a different
colour, is:

  http://www.firstpr.com.au/ip/ivip/ipv4-bits/jumbo-mss.ht << ml

     (Type in the "ml" at the end manually, I don't want links
      to this big file.)

This is the full hex dump of the packets - the TCP handshake, the
HTTP request, the data packets and the ACKs.  They all have full TCP
headers, checksums etc.

If you search for 7260 you will see the longest packet - carrying 5
times the MSS amount of data.  If you look at its Identification
field (3rd block of hex) it is 3d47.  In the next outgoing packet,
the Identification field is 5 more: 3d4c:

> Can you capture ethernet headers?

Yes.  Just add 't' to this URL:

http://www.firstpr.com.au/ip/ivip/ipv4-bits/with-ethernet-headers.tx

I just did this - it is a capture of a similar packet transfer to a
client here at home.  This shows an ACK packet and in response, just
13 microseconds later, a jumboframe going out.

    07:15:57.497378 IP 72.36.140.10.80 > 150.101.162.123.3941: .
                    18357:25417(7060) ack 634 win 6963
	0x0000:  0012 807c 117f 0015 609b 0c04 0800 4500
	0x0010:  1bbc 244c 4000 4006 ede0 4824 8c0a 9665
        ...
	0x1ba0:  af6b 6664 0dcc 168e a084 dc28 a81b efb6
	0x1bb0:  26ac 8aeb 4149 413c dba6 f5e8 0e1a 5d2d
	0x1bc0:  bc10 7fa0 5d33 0310 695c

The IPv4 packet is:

         Length = 7100 bytes
         ||||
    4500 1bbc 244c 4000 4006 ede0 4824 8c0a
                   |
                   DF=1

The TCP segment size is 7060 - exactly 5 times the lower of the two
MSS values - 1412.

This section:

http://www.firstpr.com.au/ip/ivip/ipv4-bits/actual-packets.html#2008-08-12

shows the timing of outgoing packets with respect to the ACKs which
seem to prompt them.  Two of the packets are 8512 bytes.  This is a
TCP segment of 6 x 1412 = 8472.

> Maybe some work is being offloaded to the NIC?

That wouldn't fit with the complete IP header for the whole
jumboframe, or the ethernet packet dump.  I am sure this is a true
record of the physical packet leaving the machine.

> If not, I'd say that all of this is a bug in the linux networking
> code (which is weird to begin with)

I can't imagine it is a bug.  It is conceivable it is http doing
this, but the short turnaround time between each ACK arriving the
the large packet being sent out makes me think there is something in
the kernel which is bunching together the contents of TCP packets
created by httpd, and then having a go at firing them out to the
Net, with DF=1 - presumably being able to fire the same stuff out in
smaller, or even normal sized, TCP packets if it gets a PTB.

> but I have no explanation about why you would be seeing normal
> size packets without fragmentation. I'm pretty sure ISPs wouldn't
> want to expend CPU cycles to do this on behalf of their hosted
> customers...

I have no explanation for either of these things - the server
bundling together TCP data in flagrant violation of the RFCs as I
understand them, and (as best I can guess) the PPPoE router taking
it upon itself to recreate the individual RFC conformant TCP packets.


> (BTW, I thought having a server 1600 km away was impressive...)

Its 14,500km or so (9,000 miles) from Melbourne to Dallas Fort
Worth.  Editing text files on the server involves my keystroke going
through a planet thickness of quartz to the server, and the new
character coming back the same way, all in about 230ms.


>> I couldn't easily find the thread at:
>
>>  http://www.merit.edu/mail.archives/nanog/
>
> Look for "microsoft".

Its not in the subject lines, and there is no search facility.

>>> ??? Why would advertising a large MSS be a problem? You send
>>> what the other advertises he/she can handle and obviously _they_
>>> will be sending you what they can handle.
>>
>> Yes, but what if, for some reason, there is a router in the path
>> with a smaller MTU than is generally seen by the client or by
>> Google?
>
> MSS is end-to-end, you still need PMTUD or fragmentation.

Yes - and with Google sending out large packets with DF=0, it is
expecting any hapless router in the middle, with a lower next hop
MTU than this length, to do a lot of work without complaint.


>> I think there is no workable alternative to RFC 1191 PMTUD.
>
> What they should have done back then was create a mechanism that
> allows the receiver of fragments to tell the sender that the
> packet was fragmented and what the size of the largest fragment
> was.
>
> This would have been harder to deploy (changes on both ends) but
> more robust.

OK - so packets would not be dropped, just chopped into two or more
fragments, which themselves might be fragmented too.

Then you rely on the destination host to tell the sender about the
fragmentation, rather than the router which fragments.

I think this is a lot of work for the router.

Better to drop the too-big packet and send a PTB.  That would be
faster and so bring forward the time when the sending host creates
packets which are suitable for the whole path.  RFC 1191 has the
router send the exact MTU, which is better than what I think you are
suggesting, since the destination host wouldn't be able to tell what
the MTU limit was which caused the fragmentation.  That would leave
the sending host no option but to use trial and error - sending more
not quite so long packets which would have a high chance of being
fragmentted too.

I think RFC 1191 is a better approach than what I understand of your
suggestion.


>> RFC 4821 is so difficult to implement
>
> Indeed. Still, it probably has to be done at some point,
> especially if we ever want to move away from 1500 as the
> internet's maximum packet size.

We should all stop burning fossil fuel at some point too.

I don't see a problem with the widespread use of 1500 byte MTU gear
being incompatible with RFC 1191.

It would be nice if the sending host had a better clue about the
outside world than the simple fact that its Ethernet link has an MTU
of 9k or so.  This is pretty dumb, and it might involve each session
sending a 9k packet towards the destination host, where some poor
1500 MTU next hop router goes "Not again . . . " and sends back a
PTB for the millionth time.

As long as most of the PTBs get back to the RFC 1191 compliant
sending host, I think it will work fine.

However, designing a map-encap system which does not completely
disrupt this in terms of MTU limits between the ITR and ETR is very
challenging.

>>> The first mistake was to invent the DF bit in the first place.
>
>> I guess you mean that all packets should always have been
>> non-fragmentable and that something like RFC 1191 should always
>> have  been in existence.
>
> No: if you have fragmentation anyway, there is no reason to have a
> source say it can't be done. It would arguably be useful for the
> destination to say that, but this isn't what DF does so before RFC
> 1191 came along it was useless.

I can't understand why the PTB message was first defined without
also including the MTU value which the packet's length exceeded.

It seems like such a no-brainer - and the RFC states that the MTU
Discovery Working Group spents months reinventing what "was first
suggested by Geof Cooper, who in two short paragraphs set out all
the basic ideas".


>>> The second mistake is to suggest that the DF bit be set for ALL
>>> packets to do PMTUD in RFC 1191.
>>
>> I don't understand your objection.
>
> Set it only for 10% of your packets and you still have
> connectivity when there is a black hole and the PMTUD works just
> fine.

OK - so routers would fragment 90% of the packets and the PTB only
goes back when one of the 10% of packets has its DF flag set?

That just seems to slow down the sending host's response to the MTU
 situation.

I still like RFC 1191 better.  There's no fragmentation and the
sending host gets the fastest possible feedback that it needs to
send smaller packets.

That fragmented situation is less reliable than when the packets
sail straight through, so I think it is a good thing get the sending
host properly adapted to the PMTU ASAP.

>> Removing fragmentation from the network is a really good aspect
>> of IPv6, I think.  Ideally, I think, all packets should be sent
>> DF=1 and all applications should be ready to cope
>
> No. This is a layer 3 job, not a layer 7 job.

I don't understand this.

> An interesting approach would be to simply truncate packets that
> are too big rather than fragment or drop them. A difference
> between the IP length field and the actual length of the packet
> indicates truncation.
>
> Transports would have to be changed to semi-ACK truncated data so
> the sender only retransmits a checksum over the semi-ACKed data
> after which a full ACK/NAK is possible.

I don't clearly understand this either - but it sounds messy.

> The semi-ACK also implicitly signals the maximum path MTU.

Yes, this would have the advantage that if there were a series of
narrower bottlenecks - 1400, 1300, 1170 - that in a single
round-trip time the sending host would know the full PMTU to the
destination host.  With RFC 1191, this would take three round trip
times.  Unfortunately, the surviving packet fragment isn't much use
to the destination host, so it still takes 1.5 RTTs to get the data
there.  Still, that is better than 3.5 RTTs with RFC 1191.


>> The only reasonable solution seems to be send all packets DF=0
>> and expect all routers to report PMTU troubles with a PTB
>> message.
>
> You mean DF=1?

Oops - Yes.

> DF=0 is not a solution for IPv6...
>
>> Networks which block PTB packets are doing themselves and anyone
>> who connects to them a grave disservice.
>
> Yes, but they've been getting away with it so far because
> "everyone" supports a 1500-byte MTU. So now breaking _that_
> assumption creates problems.

My understanding of this is that if all hosts have a next hop MTU of
1500, and the core has an MTU of 9000, then it is no problem if the
destination network blocks PTBs from leaving that network since no
host would be sending packets bigger than 1500 anyway.

But plenty of servers - probably most by now - have gigabit ethernet
 and so have a real PMTU for most of the core, and into quite a few
edge networks, of 9k or so.

When they send a packet to some edge network with 1500 MTU links,
which blocks the PTBs which should go back to the sending host, then
there is a black hole.  It gets messier with various hosts having
various next hop and nearby MTU limits in the ~1460 to 1500 range,
various host settings and various ICMP-blockig destination networks
with their own ~1460 to 1500 MTU limits.

I guess the majority of websites now can send jumboframes, like my
server can.  But my server offers an MSS of 1460 for a packet size
of 1500.  Then, if an ICMP-filtering edge network has a 1500 MTU,
there is no problem.  I am not sure where the MSS is configured.

But for reasons unknown, my server is trying its luck with DF=1
jumboframes way longer than 1500 or the MSS from the client.

I would think that this strategy would come unstuck with an
ICMP-filtering edge network with a 1500 MTU - unless this
"TCP bundling" facility also disabled itself in the absence of ACKs.

If this mysterious system did come unstuck due to networks blocking
PTBs, then I would get complaints that people couldn't access some
things (larger files) on my website - but there are no such complaints.


>>> I'm not sure if implicitly making IPv6 packets unfragmentable
>>> was mistake, but relying on ICMP messages was.
>
>> Do you suggest some other kind of message, or do you think PMTUD
>> should be done on the basis of positive acknowledgements alone,
>> with silent discarding of a too-big packet at whichever router
>> can't handle it?
>
> With IPv6, it would have been possible to come up with a
> truncation approach or maybe something where routers write a
> maximum packet size in certain packets.
>
> But now the only way forward is RFC 4821 etc while working hard to
> fix PMTUD black holes until 4821 is widely implemented.
>
>> Google:   No results found for "RFC 4821 deployment".
>
> Yeah, none for "RFC 791 deployment" either...

Touché - but where is the evidence of applications and operating
systems actually implementing RFC 4821?  Is there any site, any
working group or whatever where this is discussed?

 - Robin

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg