[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: soft state (was Re: shim6 and bit errors in data packet headers
On 31-mei-2005, at 22:56, Erik Nordmark wrote:
This is the debate about positive vs. negative advise from the ULPs.
Hm, in a binary world saying no or not saying yes are the same
thing... Not sure if there is a big difference here.
Ah - but this is not binary.
A ULP can do three things at any given time:
- provide no advise
- provide positive advise (things are making progress)
- provide negative advise (I see some problems/retransmissions)
(BTW: advise is a verb, advice is a noun.)
Thus the lack of positive advise is not the same as negative advise.
So what we really need is: good / unknown / bad rather than either
good / unknown or bad / unknown.
I want the shim to monitor ULP progress rather than depend on the
actual ULP to provide feedback. The problem with that would be
that in many cases (= 99% of the time when UDP is the payload),
the "real" ULP is implemented in the application.
Adding code in the shim that parses ULP headers to determine
"progress" doesn't make an implementation perform better, and
requiring that the shim understands this for all possible ULPs
(think raw sockets) doesn't make it easier to deploy the shim.
I imagine some optimizations such as recognizing that there is no
reply for TCP ack only packets, but let's ignore those for now.
What I propose is a mechanism that purely looks whether traffic is
flowing in both directions. This doesn't require parsing any headers
except source and destination addresses which the shim must look at
anyway.
A technique based on positive advise from the ULP is robust against
the case when the ULP doesn't provide any advise, since this would
trigger the shim to do it's own data driven probes.
So we can handle the UDP case just fine in this approach. The
positive advise from the ULP is a performance optimization; when
the advise arrives in the shim it removes the need for the shim to
probe.
Yes, but it opens the door for continuous reachability probes, which
is a bad thing because it wastes bandwidth and because it is likely
to detect failures when the link is idle, which we shouldn't do IMO.
But how about this: each side tells the other side a timeout value:
after not having seen any traffic from A, B starts probing. Now one
of three situations can happen:
- regular traffic: the timer is restarted before it expires by regular
traffic, so the timer never expires and there are no probes
- irregular traffic: in order to make sure the timer doesn't expire if
there isn't any traffic for some time, the sender injects keepalives
so there are no probes
- no traffic: the sender sets the timer to a very large value or
infinity, so there are no probes
Actual reachability probes sent by the receiving end would pretty
much only happen when there is a failure, and by selecting a good
value for the timer it should be possible to suppress keepalives
pretty much all the time except when traffic volume for a session
changes. At the same time, this allows a large amount of flexibility
for accommodating application needs.
I think this can be implemented fairly efficiently in the shim code
path.
But things are problematic on B, because there isn't an
(efficient) strategy for the TCP on B to generate negative
advise - it doesn't run a retransmit timer.
Ah, but if A can detect the failure in this case, then that would
be good enough if A can tell B about it at some point.
Yes, but this assumes that packets from A to B in fact get delivered.
Since the failure could be for the A->B direction, the B->A
direction, or both, there are cases when B would not be informed
that A is seeing problems,
I'm assuming that when reachability probes are sent, probes with
different address pairs are sent until a working pair is found in
both directions, or it is determined that there is no bidirectional
connectivity anymore. So B would be informed unless there is no
longer any reachability possible.
I agree that we can have either or both ends try all the N*M
locator pairs, i.e. that the technique works. But the issue is how
efficient it would be and whether positive vs. negative advise
makes a difference here on what we can do in the shim.
It's hard to say for sure what's going to work best in practice, so
we need some experimentation at some point. However, that doesn't
help us now, as we need to find a good candidate or a small number of
candidates for the initial experiments.
Thus when something fails it will always be up to A to initiate
the exploration of alternate locator pairs. Also, the time at
which the exploration of alternates start is a function of the
retransmit behavior of the ULP, which makes it harder to tightly
control the failover time.
No, in my plan the shim wouldn't know about retransmissions, it
only looks for return traffic. So either the timeout is
relatively long to accommodate ULPs that don't send traffic in
the low-traffic direction very often (I think streaming A/V
protocols send an ack every 10 seconds or so) or relatively short
but then there would be almost continuous reachability probes in
at least one direction.
But then the best you can do is determined by the ULPs (re)
transmission behavior, which is why I say that you can't control
the failover time in the shim.
In this scenario the failover time depends on the assumptions about
the maximum delay between traffic in one direction and traffic (acks)
in the other direction. I think this would be some 10 seconds, adding
a few extra seconds for network delays and random variations that
would make for around 15 seconds. Compared to what we have now worst
case or average case that's pretty good.
If the ULP retransmits 10 times with binary exponential backoff
starting with a timeout of 4 seconds, and it has been told to send
negative advise after consuming half the retransmits, then the shim
will see negative advise after 4+8+16+32+64 seconds.
(Note that this doesn't apply to what I was talking about as the ULP
itself isn't involved.)
What kind of failover time do you imagine, BTW?
10 seconds for sending a probe might not be a bad default.
I agree 10 secs is a good starting point but I'm afraid using 10
seconds will clash with transports that use 10 seconds themselves...
We probably need to review a bunch of ULPs to make a good decision here.
If we think we can send a small number of probes in parallel (3 or
so) with binary exponential backoff for the probes, we might be
able to recover from the failure in one RTT after those 10 seconds,
but if there are lots of address pairs to try and more than one has
failed, it can take a lot longer. If the shim itself is
conservative and has a 4 second probe timeout with exponential
backoff, then it would
- send 3 probes at time 10 seconds
- send 3 other probes at time 14 seconds
- send 3 more at time 22 seconds
Why send 3 at the same time and then wait? Even with a small packet
train of 3 packets we're unnecessarily bursty. I think sending one at
a time would be better, for instance:
- at 10 seconds
- at 11 seconds
- at 12 seconds
- at 14 seconds
- at 16 seconds
- at 19 seconds
- at 22 seconds
- at 25 seconds
- at 29 seconds
With such a strategy the shim implementation can do a check
after sending a ULP packet: "how long time ago since some
positive advise?"
AFter every packet...?
Yes, but a node which implements the mandatory NUD in RFC 2461
already has that test in the code path, so it might very well be
possible to implement the shim6 liveness check without adding an
extra test to the code path.
Hm, ok. But we have to be careful about mandating continuous
communication between layers. In a properly layered implementation,
this type of communication can be quite expensive (context switches
and so on).