[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: failure detection
On Fri, 19 Aug 2005, Iljitsch van Beijnum wrote:
RAs have very long lifetimes. I think the Cisco default is a week.
You can't bring this down to a minute or less without all kinds of
interesting side effects.
You have 2 lifetimes, preferred and valid. Valid can't go under 2
hours. Preferred can be as short as you like, down to 3s, and
prefered is the one we're interested in.
I have preferred set to 10s here.
What are the side effects?
I agree that when certain information is available, it makes sense
to distribute it locally rather than have every host go out and
discover the same facts for itself. We'll have to come back to this
at some point.
Progress :).
- fix the problems in internet routing
We await your suggestions...
Ha!
However, in a working group not so far away, there are people looking
at these things.
The trouble is that you need aggregation to make routing scale, and
with aggregation you lose all this interesting info that would have
been useful. Routing can still tell you some interesting things
when there are wide-spread catastrophes, but I'm not sure it's
worth the trouble to optimize for that. (Or rather: I'm pretty sure
it isn't.)
That's great, but we're discussing a host protocol (and maybe even a
leaf-site border protocol). Hosts should behave like hosts and not
try probe every possible path by default. Imagine:
Host2(shimmed)
| \
| \
ISP3 ISP4
| \ |
| \_tier-1
tier-2 | \
\ \__|_ |
\ | \ |
ISP1 ISP2
\ /
\ /
Host1 (shimmed)
Host1 is communicating with Host2 using Host2's tier-2 ISP locator.
Tier-2 has a failure affecting the shim 'flow' Host is using. The
tier-1's POP gets 2*X probe packets - for no good reason.
Further, within a minute, maybe less, of tier-2 failing, both ISP1
and ISP2 have switched over their routes to ISP3 to go via tier-1.
Now imagine ISP1 and ISP2 have thousands upon thousands of shimmed
customers. Now imagine these probes at an internet wide scale. Is it
worth introducing all that n^2 probing noise into the internet when
core internet routing likely will fix the problem anyway, maybe
within a minute, maybe faster?
Probe for availability of the /remote/ locators sure. But don't
combine it with every possible combination of local address please -
it's not needed.
wow 2*4^2, ie 32 packets to complete probing (worst case). Imagine 50 such
shim6 hosts on your network.
Well you really want to send at least 3 probes to account for random packet
loss. :-)
Ouch.
But it's easy. Shim6 is *not* TCP, it doesnt need to maintain any
/specific/ consistency of addresses. Eg, in this example, why on
earth is B replying with (B1,A1)? The reply from B (in my mind)
would be (B3,A1).
The reason why it's not easy is that at this point, the shim hasn't been
activated yet, we're just doing regular TCP.
I don't quite understand this. This is intended as an optimisation
for the case where the ULID's are 'routeable' directly between the 2
hosts right?
This is necessary to maintain backward compatibility.
I'm not sure I fully understand this point.
I suspect this (having a ULID be a locator) can be done too. So you
you may only have to shim if addresses change. But once shimmed, the
/shim/ mapping can be of *sets* of addresses to the other set of
addresses, not of specific tuples. Ie, the mapping should be:
( {A1,A2,A3,A4} , {B1,B2,B3,B4} ) -> .....
Then you can drop/add things out of the sets as required.
If the mapping maps A1 to A1, fair enough. (This is the "null
transform" case in Geoff's architecture case right?)
And even if you activate the shim at this point, the two sides
haven't been able to compare notes yet, so you can't start doing
strange tricks yet, or at least you run into security
complications.
Ah, I'm not familiar with these, would you be able to explain or
refer me to something?
Ah, good that you said so because we all thought you were
supporting this.
ROFL :). Just thought I'd make it clear ;).
Don't forget that if the site exit router does its own version of
the ingress filtering, it can send back ICMP messages so the host
knows that this source address doesn't work and move on without
much delay.
Yep.
So after a maximum of 3 messages with incorrect source addresses A
knows it should use A4, and then it only has to do B1, B2, B3 and
B4 to find the working A4-B4 pair.
Yep.
Also, if the host has several sessions towards different
destinations, it may observe that if 2001:a900:456::1 isn't
working, so if it has to choose between trying 2001:a900:789::1 and
3ffe:ffff:789::1 it will choose the latter because there is a
chance the whole 2001:a900::/32 block is affected.
That seems a possible strategy, yes.
So in reality having to test 2*n^2 will be extremely unlikely.
I would agree, but more because imho the internet is reasonably
reliable, so usually only a few will fail.
2*n^2 is for both sides btw - sent and received. n^2 (or local*remote
locators, tending to n^2 for worst case) is probes sent. And the 2 is
insignificant anyway compared to the square ;).
The trouble is that many small ISPs around here connect to the rest
of the world through one location in Amsterdam, so when there is a
power failure at that location, not only their AMS-IX stuff goes
down but also their transit. Last time there was an AMS-IX power
failure (a month before the generator they were installing because
of the one-but-last power failure went online) about 25% of all
AMS-IX members were completely unreachable for me.
Yes, that seems to be a problem in BeNeLux - huge overdependence on
AMS-IX. When you're that close to AMS-IX and everyone is there,
there's not much 'push' to setup physically seperate transit and
peerings. That though is an internet architecture issue, particularly
for NL and BeNeLux.
The internet ecosystem will evolve further to cope with these things
(outside of shim), eg my point about VoIP being a driver in the
routing area (see mail to marcelo). Routing will get better and
better.
BTW, my ISP just had a very big DoS attack. The shim would have
enabled me to keep working to the extent possible, routing can't
really do anything in these cases.
Yep, I really want shim too. ;) I want it to allow for sanity though.
;) (while still allowing for implementations to try full probing if
they really must).
Do you want to add to the DoS of your ISP by having lots of shimmed
hosts then go and DoS the *other* ISP? :)
Note, that for something like a DoS, where you're getting just poor
service rather than no service, it might be easier to just update
your routes or your local IPv6 RA prefixes to not use that ISP.
That's an argument again for making shim6 interact with normal OS
routing+SAS layer like any other application.
For extra-browny points, write a small daemon to monitor RTT and loss
rates to each of your ISPs and adjust local routes/prefixes/whatever
to suit to your tastes.
regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
Fortune:
Pick another fortune cookie.