[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Preserving established communications (was RE: about draft-nordmark-multi6-noid-00)



On 27 okt 2003, at 14:26, Erik Nordmark wrote:

I don't do operations thus I'd be interested in folks with operational
experience commenting on the common and likely failures that a site should
worry about.
What I've heard of are failures due to links being cut between sites and their
ISP, backbone links back-hoed (but don't understand what actual impact
they had on each ISPs network), and ISPs going bankrupt.

Don't forget power outages.


The result is always the same: one or more links go down. If these are attached to (real) routers on both ends, the failover happens pretty quickly as rerouting is triggered by the interface going down. If there there is layer 2 gook in the middle, it takes longer as BGP sessions must time out. But even in these cases failover is usually fast enough so that when the user experiences a problem, the problem is already gone when they investigate or retry. (10 seconds - 1 minute.)

The real fun starts when a large network has routing problems.

My take is that we should make the multihoming solution improve the
availability of sites with multiple Internet attachments without requiring or
assuming e2e periodic pings to quickly detect failures.

Agree, but note that there are ways to optimize these so they're not as evil as they could be if done using some kind of proxy.


Some applications/upper layer protocols might want to use such mechanisms
in addition to the rehoming support in the multihoming solution for quicker
failures (For instance, SCTP already has such a mechanism; heartbeats.)
Thus I'm advocating not assuming that every ULP connection/session/assocatin
has hearbeats when solving rehoming since we know of neither the performance
implications of this on a large scale, nor the benefits that applications
in general will derive from it.

Layer 4 already needs to receive acknowledgements from the other end in order to be sure it can continue to send. It's not much of a stretch to have a layer 4 protocol send a message back to the mh layer saying "you may want to trigger rehoming" when there are no ACKs for a while.