[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: soft state (was Re: shim6 and bit errors in data packet headers
marcelo bagnulo braun wrote:
Why would we want to couple the state management aspects of shim6 but
the shim6 test protocol? To me any such coupling seems undesirable,
especially since the parameters for the test protocol (how quickly to
detect failures) might be a function of upper layer advise, as well as
upper layer hints of "working" or "not working".
Well, i guess that the situation when one of the nodes has lost the shim
state can be seen as a form of failure and my assumption is that failure
detection mechanisms will likely detect it first
But that's a circular argument for including the context state in the
failure detection mechanism. You are in effect saying that the test
protocol should test whether the context has been lost on the peer since
it can be made to test for a lost context on the peer.
FWIW the outline of a test protocol in section 5.4 of
draft-arkko-multi6dt-failure-detection-00.txt doesn't assume such a
thing. (But it does assume that B remember something about previously
received probes, so there are some issues about DoS opportunities.)
I think that the protocol behaviour would be something like this.
A communication is established between node A and node B
Later on, a shim context is created between those two nodes.
The parameters for that context are:
ULIDs: IPA1 and IPB1
Locators: for IPA1 (IPA1,...,IPAn)
for IPB1 (IPB1,...,IPBm)
And a context tag presumably.
Suppose that for some reason node B losses the shim context (and only
the shim context, i.e. the application and transport state about ongoing
communications is preserved)
I guess that at this point we have several scenarios to consider:
Scenario a): the communication between A and B is still using IPA1 and
IPB1 as locators.
This scenario has two subcases:
Scenario a.1) The communication is bidirectional and e.g.
TCP is providing ack of the progress of the communication
this means that no periodic reachability test
nor any other shim signaling is being exchanged.
In this scenario, a lost of SHIM context would remain
undetected until there is a failure and node A detects it
and tries to explore alternative paths. This is so because
data packets will carry ULIDs and will be passed successfully
to the upper layers.
If we assume that B (as well as A) will have a heuristic to create shim6
contexts (e.g. based on having received 50 packets for a locator pair),
then this heuristic might be trigger and cause B to try to establish a
context with A, at which point in time A will see that it already has a
context with B.
Once that there is a failure, then
reachability test packets won't be recognized as belonging
to any existent shim context and the problem can be detected.
Here you are already assuming that reachability test packets will not be
recognized, i.e. presupposing a particular interaction between the state
management and the test protocol.
Scenario a.2) The communication is unidirectional
In this case, periodic reachability test need to be
performed in order to verify that the path is still working
If the node B losses its shim state, it won't recongnize
the reachability test packets, and the lost of context can
be detected
Again, here you are presupposing a particular interaction.
Scenario b) the communication between A and B is using alternative
locators.
In this case, when node B losses the context, data packets won't be
properly delivered in node B, because it won't be properly demuxed.
At this point, the reachability test will be performed to verify the
locator pair being used
If you are using alternate locators and the working locator pair is
unidirectional, then it seems like you'd need to be able to re-discover
that working unidirectional locator pair, before you can re-establish
the context state on B.
Thus if A is sending using IPA1->IPB2 and B was replying using
IPB1->IPA2, and B looses the context state, what do you do?
Seems like solving this case requires that the test protocol is not tied
in with the state management.
I don't know if i am missing something, but AFAICS, all the situations
when the shim context is lost result in a reachability test exchange,
and that is why i was wondering if it wouldn't make sense to define a
"no-context" error message as a rply to a reachability test request packet.
That is one particular solution with strong coupling between the test
protocol and the state management.
But don't we want to retain the possibility to test locator pairs for
initial contact, i.e. before a context is established between the peers?
And handle the above case of unidirectional locator pairs?
But i fail to understand how the node that has lost the state can
identify that a data packet belongs to a non existent shim state....
By seeing that the <source locator, destination locator, context tag>
doesn't match any existing context?
I suspect we want that capability for robustness in any case.
I mean, i guess that a first element that is relevant here is where are
we going to carry the context tag.
If the context tag is carried in a extension header or dest option, then
i can see that if a node receives an packet with one of those, can
easily detect that there is no context associated. (note that in this
case, the context loss is only detected in the case where the locators
used for the communication differ from the ULIDs, i.e. the extension
header dst option is included in the packet)
If the context tag is included in the flow label, then i don't see how a
node that receives the data packet can determine that the packet is
associated to a shim context that is no longer there. At this point, i
gues that as you mentioned in a previous mail, the data packet would be
silently discarded, right?
If the context tag is carried as a flow label, I still think we need a
way to tell the receiver "this is a shim6 packet". For robustness
reasons I think the fact that the packet needs shim6 processing should
be explicit.
There has been proposals in multi6 which suggested doing this without
making the packets larger by defining a set of new nexthdr values with
meaning like
shim6+tcp
shim6+udp
...
shim6+esp
Not having that "shim6" bit when the flow label is used as a context tag
can easily result in hard to diagnose errors. We might have errors due
to some middlebox messing with the data packets (a TCP relay for
instance), but that leaves the shim6 test packets alone. If the TCP
relay doesn't preserve the flow label, then the packets would be dropped
due to TCP checksum errors (since the ULID rewrite didn't happen), but
the test protocol would say that everything is fine.
I think that at this point is clear to me that if we define a no-context
error message, this message should be defined as a reply to a packet
that refers to that context and it should include enough information
about this initial packet to verify that is a reply to that packet.
The no-context error message cannot be issued spontaneously by a node.
Agreed.
Erik