[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ISP failures and site multihoming [Re: Enforcing unreachabilityof site local addresses]



On Thu, 20 Feb 2003, Pekka Savola wrote:

> > Your comment may be true, but my clients are nonetheless unwilling to risk
> > the possibility of an extended network outage on a single ISP (while not
> > frequent, these events are far from unprecedented) rendering their online
> > customer-support environment unavailable for several hours, much less for
> > a day.  Shorter outages (on the order of minutes in the single digits) are
> > tolerated, provided that such outages are infrequent.

> This is a very problematic approach IMO.

> Need more resiliency?  Network outages unacceptable?

> The right place to fix this is the network service provider, period.
> Nothing else seems like a scalable approach.

There is no technical reason why a single service provider network can
do better than a similar network that consists of several smaller
service provider networks. Sure, BGP as-is doesn't provide the seamless
failorver some people would like. It annoys me to no end that Cisco uses
a 180 second default hold time for BGP, twice the already too
conservative value that is is suggested in the RFC. This means that when
a circuit goes down BGP takes two or three minutes to notice this. I
always recommend configuring a hold time of 15 seconds, but it seems
some vendors have designed their stuff in such a way that sessions can
fail with this value when the box is busy. But IGPs have the same
fundamental problem (although the details may differ). OSPF for instance
takes 40 seconds to detect a dead circuit.

> Consider a case when many companies _phone_ services would have been
> changed to VoIP.  IP would be a critical service.  Do the enterprises
> protect against failures by getting more ISP's?  Unscalable.  No, the
> ISP's _must_ get better.  Pick one well when choosing them.

We are _very_ far from a situation where even the best ISP provides a
service level that is better then the one you get from multihoming even
if you consider failover delays.

Also, these approaches aren't mutually exclusive. ISPs should get
better. Multihoming should get better.

At the same time, we should recognize that it is simply impossible to
have the same failover delays at layer 3 as at layer 1.

> When ISP's have SLA's, a lot of customers for which continued service is
> of utmost importance, the networks *will* work.  There is just no other
> choice.  If the mobile phone of CTO, CEO or whatever rings after (1)5
> minutes of network outage, things _will_ happen.

My experience with SLAs is that they are a marketing tool and job
security for bureaucrats. They don't make the worst case any better,
they only make the worst case slightly less frequent.

(What makes you think this mobile phone will ring anyway? Speaking of
unreliable networks...)

And the single service provider thing doesn't scale anyway: the end
result would have to be a single global ISP.

> It just seems the mentality in some networks is that network outages are
> ok, networks don't have to be designed with multiple connections, etc.etc.

> That must change if we want to build a mission-critical IP infrastructure.
> Instead of making every site try to deal with the problems themselves.

Has the end-to-end principle failed to teach us anything? Reliability
begins and ends in the end hosts. If each host is connected over two
service providers there are four possible paths the hosts can switch
between on a per-packet basis. Then the only problem becomes detecting
failure. The end hosts are in an excellent position to do this without
having to generate keepalive messages; a well designed protocol could
switch to an alternate path within a few round trip times when a path
failure occurs.

Multi6 has been gravitating towards multi-address multihoming solutions
for a while now, but unfortunately it seems impossible to move foward.