[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Perplexing PMTUD and packet length observations



On 12 aug 2008, at 15:55, Robin Whittle wrote:

If not, I'd say that all of this is a bug in the linux networking
code (which is weird to begin with)

I can't imagine it is a bug.

Well the Linux people do stuff that other people call bugs sometimes... IMO, limiting the amount of data but not the number of packets as per the correspondent's MSS is not an appropriate way to do TCP congestion control.

It is conceivable it is http doing this,

Not really. HTTP needs to go through TCP so it has to be TCP.

However, Linux has a "sendfile" syscall that does what you imagine it would, so maybe it's that code that does this. You may want to try with a static file and a script that creates output that is large enough to see if that's the case.

I have no explanation for either of these things - the server
bundling together TCP data in flagrant violation of the RFCs as I
understand them, and (as best I can guess) the PPPoE router taking
it upon itself to recreate the individual RFC conformant TCP packets.

Hm, I'd still like to see if the Linux box actually spits out jumboframes or not from another system. What do you have your MTU set to?

Its not in the subject lines, and there is no search facility.

Hm, it wasn't so recent apparently, and ack on the quality of the archive:

http://readlist.com/lists/trapdoor.merit.edu/nanog/7/35484.html

MSS is end-to-end, you still need PMTUD or fragmentation.

Yes - and with Google sending out large packets with DF=0, it is
expecting any hapless router in the middle, with a lower next hop
MTU than this length, to do a lot of work without complaint.

Such are the perils of implementing RFC 791.

Fragmentation CAN be done in hardware. Not sure how much gear does that. But you can always get around this by configuring all your stuff with the same MTU. (Your ISPs may not want to go along with that, though.)

Indeed. Still, it probably has to be done at some point,
especially if we ever want to move away from 1500 as the
internet's maximum packet size.

We should all stop burning fossil fuel at some point too.

The difference is that burning oil is cheaper than the alternative, while breaking connectivity and sending packets smaller than what the hardware is capable of isn't.

It would be nice if the sending host had a better clue about the
outside world than the simple fact that its Ethernet link has an MTU
of 9k or so.

There are two ways to accomplish this:

1. ask someone who knows
2. measure

MTU information isn't known so there's noone to ask (could be added to routing protocols, though, but apparently few people cared enough to do that), so the only alternative is to send packets and observe what comes back.

Set it only for 10% of your packets and you still have
connectivity when there is a black hole and the PMTUD works just
fine.

OK - so routers would fragment 90% of the packets and the PTB only
goes back when one of the 10% of packets has its DF flag set?

That just seems to slow down the sending host's response to the MTU
situation.

Yes, that's what the RFC 1191 people keep saying.

Apparently there's a disconnect between spec writers and implementers on the one hand and people who have to debug connectivity problems on the other hand, with the clueless firewall admins living in a bubble disconnected from everything.

I still like RFC 1191 better.  There's no fragmentation and the
sending host gets the fastest possible feedback that it needs to
send smaller packets.

I'll take reliable over fast.

Removing fragmentation from the network is a really good aspect
of IPv6, I think.  Ideally, I think, all packets should be sent
DF=1 and all applications should be ready to cope

No. This is a layer 3 job, not a layer 7 job.

I don't understand this.

Application writers can't be trusted with the internals of the network.

Unfortunately, the surviving packet fragment isn't much use
to the destination host, so it still takes 1.5 RTTs to get the data
there.  Still, that is better than 3.5 RTTs with RFC 1191.

You can still use the data, except that you can't check its integrity because the checksum is now incorrect. So the semi-ACK asks the other side to send just the checksum over the data that was correctly received. Did I forget to explain this part?

My understanding of this is that if all hosts have a next hop MTU of
1500, and the core has an MTU of 9000, then it is no problem if the
destination network blocks PTBs from leaving that network since no
host would be sending packets bigger than 1500 anyway.

Your premise is invalid so the conclusion is meaningless.

Actually, it would be interesting to do some research into the MTU distribution across the internet.

But plenty of servers - probably most by now - have gigabit ethernet
and so have a real PMTU for most of the core, and into quite a few
edge networks, of 9k or so.

Note that although the 9000-byte jumboframe capability is common, there are also very many implementations that use different sizes so it's impossible to standardize on anything, even if you could ignore the fact that the current internet expects 1500.

Also, because of the 802.3 spec, the jumboframe capability must be enabled administratively, and because of the IP-over-ethernet specs, all hosts on a subnet must use the same MTU, so basically deployment is impossible. (This is what my draft addresses.)

When they send a packet to some edge network with 1500 MTU links,
which blocks the PTBs which should go back to the sending host, then
there is a black hole.

You mean: a network that doesn't generate them in the outgoing direction?

In practice this won't be a problem because few people will connect to the internet with a 1500+ MTU and then not generate too bigs. Since routers generate them out of the box and ISPs usually don't have firewalls in the middle of their networks and don't like support calls, ISPs tend to generate them.

If you use an MTU bigger than the standard 1500, then you shoot yourself in the foot with ICMP filtering so you're not likely to do both. The trouble is mainly with using a smaller MTU: then the problem is caused by _other_ people not listening to _your_ too bigs and there is little that you can do.

I guess the majority of websites now can send jumboframes, like my
server can.

That doesn't mean that all the stuff in the middle can handle jumboframes. The core of the network generally can, but the stuff around the edges, like the cheap switches that connect dozens of servers like yours, are likely to only support small packet sizes, either 1500 or "mini jumbos" of 1500 - 2000 bytes.

I am not sure where the MSS is configured.

It isn't. The MSS option is created from the MTU of the interface the destination address is reachable through.

Yeah, none for "RFC 791 deployment" either...

Touché - but where is the evidence of applications and operating
systems actually implementing RFC 4821?

I haven't seen any...

Is there any site, any
working group or whatever where this is discussed?


Not sure if this is officially dead or not:

http://staff.psc.edu/mathis/MTU/

"Just remember: The glass is neither half full nor half empty, it is merely the wrong size."

The optimist says the glass is half full. The pessimist says it's half empty. The engineer says the glass is twice as big as it needs to be.




--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg