[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: failure detection
On 19-aug-2005, at 17:11, Paul Jakma wrote:
A failure which can easily be detected locally without any need for
n^2 probing.
The 2*n^2 probing isn't necessary to detect failures (local or
otherwise), but to detect what's still working. And even if side A
can easily detect a failure at site A (which isn't a given, if my DSL
line goes down my router knows it but my hosts don't), how does side
B learn this fact?
What's needed is:
- Host1 to detect the local failure and update the exit path to use
(and hence the source to use)
A host doesn't necessarily know which exit path a router will choose.
- this is achievable in multiple ways
- none of which need be in shim6
The shim needs to do the address rewriting to make it work, so I
guess it can be argued that the shim doesn't have to be the one to
determine this information, but it does need to know it.
- none of which require shim6 to be aware of SAS or egress
issues
So what happens when through means outside our view a packet gets a
destination address routed over ISP X, but a source address from
address space from ISP Y, and X filters Y's addresses?
- Host2 shim6 to detect host1's valid locators have changed
- Maybe because it receives a packet from Host1 with a new
source
This doesn't allow for unidirectional reachability.
You want to specify that shim6 be able to work around /any/ kind of
routing failure, anywhere on any part of the internet affecting any
path between Host1 and Host2.
Yes, I do. As a BGP jockey, I'm kind of like the health inspector who
never eats out... There is a lot going on that regular users don't
really know about.
My gut feelings though are:
- Failures typically are near the edges
Maybe those are a bit more common, but it's not like failures in the
core never happen.
- Failures are typically bi-directional for a given path
Ingress filtering has the potential to create lots of unidirectional
reachability for a given address combination.
- Uni-directional failures tend to be due to /congestion/, not
Nonsense. Congestion is rare these days, and the levels necessary to
break connectivity wholesale are almost unheard of.
- Failures in the 'middle' are uncommon, and tend to affect /huge/
numbers of paths (ie there's a decent chance it will take out /all/
your paths)
So?
Hence (as a gut feeling):
- n^2 probing in shim6 is simply introducing huge expense in order to
solve a very uncommon problem
Yeah I don't get this point you're arguing so energetically. Let's
build a test network. (I'll be talking about hosts, but obviously
many aspects are side-wide.)
Host A has two interfaces, that both eventually connect to a router
that connects to two ISPs. So:
Addr A1: int 1 - ISP K
Addr A2: int 1 - ISP L
Addr A3: int 2 - ISP M
Addr A4: int 2 - ISP N
Sanme thing for its correspondent host B:
Addr B1: int 1 - ISP O
Addr B2: int 1 - ISP P
Addr B3: int 2 - ISP Q
Addr B4: int 2 - ISP R
Let's assume that each router will do source address based routing
for the two ISPs it connects to, but the ISPs all do ingress filtering.
A initiates a TCP session with destination address B1. Let's assume
that the system chooses interface 1 for output and A1 as a source
address, so the packets have address pair A1-B1
Now it's entirely possible that B's default route is over ISP Q. So
when B sends a reply to the A1-B1 session setup request, it sends a
B1-A1 packet out on interface 2. Now either the site exit router will
filter it, or it will end up at ISP Q or ISP R, which will filter it.
This is the infamous ingress filtering problem that we have to figure
out.
But let's assume we somehow fix this problem, and packets flow
without trouble between A1 and B1. The TCP session continues for a
bit, and at some poin the shim wakes up and decides that this is a
long-term session that should be protected from failures. So the shim
layer on host A sends out a packet with source A1 and destination B1
(= addresses from the TCP session) which includes security stuff and
the list of local alternative locators: A2, A3 and A4. B also happens
to implement the shim, so it answers with some security stuff of its
own and its list of alternative locators: B2, B3 and B4.
So now we're ready for the internet to fail.
Scenario 1: A's link to ISP K fails.
Since this is something A's router can detect, presumably any packets
from A1 to B2 will get back an ICMP message, and after a few RTTs TCP
becomes really unhappy. The shim may also observe that there are
packets going from A1 to B1, but there is nothing coming in from B1
to A1. Maybe the shim decides to fire off a probe from A1 to B1 for
good measure. But eventually, it's clear that A1 to B1 doesn't work
anymore.
Now suppose that the reachability detection subsystem at A decides to
see if B2 works. If A sticks to source address A1, then the packet
will also incur an ICMP and not make it. So either A sees the ICMP
and selects a different source address, or it decides that A1-B2
doesn't seem to work either and goes on to the next address pair. For
instance A could try A2-B1. And this one works!
So from now on any outgoing packets with addresses A1-B1 in them are
rewritten into A2-B1 and sent on their way.
Any complaints so far?
Scenario 2: big failure, and everything is wiped out except A4-B4.
(From where I sit 99% of all traffic flows through Amsterdam, and
most of that 99% over the AMS-IX. A nice big power failure there
really hurts my connectivity.)
So A tries:
A1-B2
A2-B1
A1-B3
A3-B1
A1-B4
A4-B1
and on and on and on, until it eventually determines that A4-B4 works.
You don't want this to happen. So what's the alternative? Give up
after the second try? The fourth? The n^2/2th?
Remember that while all of this is going on, the transport protocol
sees a black hole. So at any time, the transport can decide to time
out. The shim doesn't do anything that actually _hurts_ regular
transport protocols.