[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: survivability, rewriting



Pekka Savola wrote:
> 
> On Fri, 31 Oct 2003, Brian E Carpenter wrote:
> > I agree with this, and I'd add that many applications can survive much
> > longer glitches than 5 seconds, and even TCP resets, by putting some fairly
> > trivial retry logic in the right place. [...]
> 
> (I'm pretty sure you agree here, but playing the devil's advocate to bring
> up an important point here..)
> 
> Is it the business of the applications to put in this retry logic?
> 
> No.
> 
> If *every* application has to do this, we've failed.  If such adding such
> logic is deemed the best approach, it needs to be put somewhere else.

That is what I would have said a few years ago. But the fact of life is
that TCP resets do occur, and if you are building a business class
application you will *not* allow that to cause an applications level
failure. So all the business class applications that I know already
have retry logic, and it was put there by programmers who wouldn't know
a multihoming event if it hit them in the face.

Actually it's just an extension of the fate sharing argument. If the host
hasn't actually crashed and burned, it should try again at successively
higher levels of the stack until things work again.

That's why I've always rated transport survivability as only "nice to have"
in multihoming.

   Brian