[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Preserving established communications (was RE: about draft-nordmark-multi6-noid-00)
On 27 okt 2003, at 14:26, Erik Nordmark wrote:
I don't do operations thus I'd be interested in folks with operational
experience commenting on the common and likely failures that a site
should
worry about.
What I've heard of are failures due to links being cut between sites
and their
ISP, backbone links back-hoed (but don't understand what actual impact
they had on each ISPs network), and ISPs going bankrupt.
Don't forget power outages.
The result is always the same: one or more links go down. If these are
attached to (real) routers on both ends, the failover happens pretty
quickly as rerouting is triggered by the interface going down. If there
there is layer 2 gook in the middle, it takes longer as BGP sessions
must time out. But even in these cases failover is usually fast enough
so that when the user experiences a problem, the problem is already
gone when they investigate or retry. (10 seconds - 1 minute.)
The real fun starts when a large network has routing problems.
My take is that we should make the multihoming solution improve the
availability of sites with multiple Internet attachments without
requiring or
assuming e2e periodic pings to quickly detect failures.
Agree, but note that there are ways to optimize these so they're not as
evil as they could be if done using some kind of proxy.
Some applications/upper layer protocols might want to use such
mechanisms
in addition to the rehoming support in the multihoming solution for
quicker
failures (For instance, SCTP already has such a mechanism; heartbeats.)
Thus I'm advocating not assuming that every ULP
connection/session/assocatin
has hearbeats when solving rehoming since we know of neither the
performance
implications of this on a large scale, nor the benefits that
applications
in general will derive from it.
Layer 4 already needs to receive acknowledgements from the other end in
order to be sure it can continue to send. It's not much of a stretch to
have a layer 4 protocol send a message back to the mh layer saying "you
may want to trigger rehoming" when there are no ACKs for a while.