[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Perplexing PMTUD and packet length observations Oops: TSO



Hi Iljitsch,

In a previous message, regarding my server apparently sending out
TCP packets which were longer than the MTU, you wrote:

> Maybe some work is being offloaded to the NIC?

I should have checked this more carefully - you are right.

The Ethernet chip is a Broadcom BCM5751, which does "Large Send
Offload" AKA "task offload", "segmentation offload" or "stack offload":

  http://www.broadcom.com/collateral/pb/5751-PB03-R.pdf
  http://www.microsoft.com/whdc/device/network/taskoffload.mspx

tcpdump sees the output of the Ethernet driver and the chip breaks
it up the long packets into ordinary length TCP packets.

Sorry about the false alarm!

This page discusses these hardware based techniques:

  http://kb.pert.geant2.net/PERTKB/LargeSendOffloadLSO

     (Transport) Protocol Fossilization

      The way it is defined by most of the industry, LSO
      needs to be aware of the transport protocols.
      In particular, it must be able to split over-large
      transport segments into suitable sub-segments, and
      generate transport (e.g. TCP) headers for these
      sub-segments. This function is typically implemented
      in the adapter's firmware, for some popular
      transport protocol such as TCP. This makes it hard
      to implement additional functions such as IPSec, or
      the TCP MD5 Authentication option, or even other
      transport protocols such as SCTP.

      There is a weakened form of LSO that requires the
      host operating system to prepare the segmentation
      and construct headers. This allows for "dumber"
      network adapters, and in particular it doesn't
      require them to be transport protocol-aware.  It
      still provides significant performance improvement
      because multiple segments can be transferred between
      host and adapter in a single transaction, which
      reduces bus occupation and other overhead.

      Sun's Solaris operating system supports this variant
      of LSO under the name of "MDT" (Multidata Transmit),
      and the Linux kernel added something similar as part
      of "GSO" in 2.6.18.

This has turned up something potentially relevant to scalable
routing - the use of hardware to generate the final packets sent to
destination hosts.  This is done today because it is marginally more
efficient in terms of CPU load when the MTU is around 1500 bytes.
I am not sure how much difference it would make if the artificially
large TCP "super packet" was up to 64k long, and the NIC split it
into 9k packets.

GSO (Generic Segmentation Offload) does the splitting into smaller
packets in software.

  http://www.linuxfoundation.org/en/Net:GSO
= http://lwn.net/Articles/189970/ Herbert Xu (2006-06-20)

"ethtool eth0" shows the NIC is running at 100Mbps, although it is a
1Gbps device.  A friend told me Internet servers are usually
connected at 100Mbps.  ethtool can  be used to turn on and off TCP
Segmentation Offloading (TSO ~ Large Send Offload) UFO (UDP
Fragmentation Offload) and GSO.  The settings were :

    rx-checksumming: on
    tx-checksumming: on
    scatter-gather: on
    tcp segmentation offload: on     <<<  TSO ~= LSO
    udp fragmentation offload: off
    generic segmentation offload: off

I gave the command: "ethtool -K eth0 tso off" and there were no more
of these long packets reported by tcpdump.


Thanks for the link to the long discussion about Microsoft's network
ignoring PTB packets in May this year:

> Hm, it wasn't so recent apparently, and ack on the quality of the archive:
> 
> http://readlist.com/lists/trapdoor.merit.edu/nanog/7/35484.html


>>> MSS is end-to-end, you still need PMTUD or fragmentation.
> 
>> Yes - and with Google sending out large packets with DF=0, it is
>> expecting any hapless router in the middle, with a lower next hop
>> MTU than this length, to do a lot of work without complaint.
> 
> Such are the perils of implementing RFC 791.

Yes, but fragmentation in the network was not included in IPv6 and I
am keen not to have it in ITRs, ETRs or in the path between them.

. . .

> Apparently there's a disconnect between spec writers and implementers on
> the one hand and people who have to debug connectivity problems on the
> other hand, with the clueless firewall admins living in a bubble
> disconnected from everything.
> 
>> I still like RFC 1191 better.  There's no fragmentation and the
>> sending host gets the fastest possible feedback that it needs to
>> send smaller packets.
> 
> I'll take reliable over fast.

I think RFC 1191 is reliable if PTBs are not filtered.  It is not a
serious problem if the odd PTB is lost due to congestion.

. . .

>> Unfortunately, the surviving packet fragment isn't much use
>> to the destination host, so it still takes 1.5 RTTs to get the data
>> there.  Still, that is better than 3.5 RTTs with RFC 1191.
> 
> You can still use the data, except that you can't check its integrity
> because the checksum is now incorrect. 

But you just wrote:

> I'll take reliable over fast.

!

> So the semi-ACK asks the other side to send just the checksum over 
> the data that was correctly received.

This sounds unreliable and slow.


>> My understanding of this is that if all hosts have a next hop MTU of
>> 1500, and the core has an MTU of 9000, then it is no problem if the
>> destination network blocks PTBs from leaving that network since no
>> host would be sending packets bigger than 1500 anyway.
> 
> Your premise is invalid so the conclusion is meaningless.

I was describing an artificial situation for the sake of discussion.


> Actually, it would be interesting to do some research into the MTU
> distribution across the internet.

Indeed.

>> But plenty of servers - probably most by now - have gigabit ethernet
>> and so have a real PMTU for most of the core, and into quite a few
>> edge networks, of 9k or so.

A friend told me today that most servers are connected with 100Mbps
links, even if it has a gigabit NIC.  Part of the reason is to
reduce burstiness of each server.

The trick would be to find a host with 1G links all the way to
several border routers, which themselves have 1G links to . . .


> Note that although the 9000-byte jumboframe capability is common, there
> are also very many implementations that use different sizes so it's
> impossible to standardize on anything, even if you could ignore the fact
> that the current internet expects 1500.
> 
> Also, because of the 802.3 spec, the jumboframe capability must be
> enabled administratively, and because of the IP-over-ethernet specs, all
> hosts on a subnet must use the same MTU, so basically deployment is
> impossible. 

Yes - I understand that a single 1500 byte MTU device on an Ethernet
switch (such as a 100Mbps NIC, or perhaps a 1Gbps NIC running with a
1500 byte MTU) forces all other devices to use 1500.

> (This is what my draft addresses.)

  http://tools.ietf.org/html/draft-van-beijnum-multi-mtu-02

OK - for IPv6 and necessarily with host and router changes.


>> When they send a packet to some edge network with 1500 MTU links,
>> which blocks the PTBs which should go back to the sending host, then
>> there is a black hole.
> 
> You mean: a network that doesn't generate them in the outgoing direction?
> 
> In practice this won't be a problem because few people will connect to
> the internet with a 1500+ MTU and then not generate too bigs. Since
> routers generate them out of the box and ISPs usually don't have
> firewalls in the middle of their networks and don't like support calls,
> ISPs tend to generate them.
> 
> If you use an MTU bigger than the standard 1500, then you shoot yourself
> in the foot with ICMP filtering so you're not likely to do both. The
> trouble is mainly with using a smaller MTU: then the problem is caused
> by _other_ people not listening to _your_ too bigs and there is little
> that you can do.

OK - I read the first message in the NANOG "Microsoft.com PMTUD
black hole?" thread:

  http://readlist.com/lists/trapdoor.merit.edu/nanog/7/35484.html

which involves Microsoft servers in a whole /16 ignoring PTB
messages sent by routers in other networks.


>> I guess the majority of websites now can send jumboframes, like my
>> server can.

Apparently not if most are connected to 100Mbs Ethernet switches,
though I guess it would be possible to use a 1Gbps NIC and switch
but somehow throttle the speed to something lower, but keep the
usual ~9k MTU of 1Gbps Ethernet.


> That doesn't mean that all the stuff in the middle can handle
> jumboframes. The core of the network generally can, but the stuff around
> the edges, like the cheap switches that connect dozens of servers like
> yours, are likely to only support small packet sizes, either 1500 or
> "mini jumbos" of 1500 - 2000 bytes.

OK - just as my friend told me.

Thanks for pursuing this discussion.  Sorry about the false alarm
with these too-long packets.

  - Robin


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg