RE: Response to ITU-T Q14/15 Liaison about CCAMP Crankback Draft

Dear Kam,

Thank you for your liaison concerning draft-ietf-ccamp-crankback-03. It is
useful to have additional review input from a wide audience. Please convey
our special thanks to Stephen Shew and Marco Carugi for their detailed
review of the draft in Geneva.

We would like to urge Q14/15 to continue to consider this draft as further
work is carried out on crankback within the context of G.7713.

In response to the specific points that were raised in the liaison...

> 1.       Semantics of the term "node". Due to the GMPLS principle of
> maintaining separation of control and transport (data/bearer) planes,
> there are two meanings for the term "node". First, an instance of a
> signalling protocol (and/or routing protocol) that has some transport
> resources in its scope. Second, a transport plane resource such as a
> cross connect. Using the first meaning, a node is not the context for
> the interface identifiers that are passed in crankback TLVs.
> Throughout the document the particular meaning can be determined
> by the context of the term. Examples are:
>
> - Section 5.2, the sentence "Otherwise, multiple nodes might attempt to
> repair the LSP." means the control functions of signalling and routing.
>
> - Section 7.1 "As described above, full crankback information SHOULD
> indicate the node, link and other resources, which have been attempted."
> refers to the transport resource.

It is correct to observe that historically there has been poor separation
of controllers and transport devices within GMPLS, with much of this issue
arising from the historic collocation of controllers and data switches in
MPLS networks. This persists because of the (eminently sensible) tendency
to optimize for the majority case.

However, in the case of crankback, and specifically in the case of this
draft, the emphasis in providing 'full crankback information' is on the
addresses of transport links and nodes and not controllers. We will
revisit the draft to ensure that where control plane function is implied,
the "node" that takes action is clearly identified as the control plane
node.

> There are some occasions where the use of the term appear to be
> ambiguous and clarity would be appreciated. In particular TLV
> types 10 and 32. If type 10 represents a routing and signalling
> function, then what TLV describes the "transport plane node"
> (e.g., cross connect or Network Element)? If type 32 means
> "transport plane nodes", then a different TLV may be needed
> to identify the "routing/signalling nodes" that have already
> participated in crankback attempts.
> Having a clearer distinction between control plane functions
> and transport plane resources would be helpful.

As indicated above, the intention of crankback is to apply a process to
the path determination for an LSP. The path is determined using transport
plane links and nodes, and although there may be some interesting
aggregation available by converting this information to control plane
nodes, the conversion is not necessarily simple. Thus, these TLVs all
refer to transport plane quantities, and we will make this clearer in the
draft.

Again, of course, in the majority case we can make considerable
optimizations by knowing that control plane and transport plane "nodes"
are related in a 1:1 ratio and are usually collocated.

> 2.       When crankback information is received at a "routing/signalling
> node", can it be used by the routing path computation function for other
> LSP requests than the LSP whose signalling caused the crankback action?

It is generally out-of-scope for the IETF to dictate how individual
implementations operate. It is quite conceivable that such an action would
be taken, but it is also clear that there is a potentially dangerous
interaction with the TE flooding process (i.e. the IGP). Thus we would say
that the crankback information MAY be used to inform other path
computations.

We would want to be very cautious that crankback is not intended to
supplement or replace the normal operation of the TE flooding mechanism
provided by the TE extensions to the IGP except for the establishment of a
single LSP. If the IGP is found to be deficient as a flooding mechanism we
would expect to look first at ways to address the problems through IGP
extensions before utilizing a signaling mechanism.

We will look at how to add some of this information to the draft.

> 3.       Section 6.1 "Segment-based Re-routing" option. It is not clear
> what this means. Can multiple "routing/signalling nodes" perform
> crankback on the same LSP at the same time if this flag is set?

Since the intention is to establish only one LSP, there must be only one
active sequence of LSP setup messages (RSVP-TE Path messages) at any time.
Thus only one LSR may attempt re-routing at any one time.

If you consider the processes by which Path messages are attempted and
crankback information is returned on PathErr messages, this will be clear.
That is, when an PSR receives a crankback PathErr, it may attempt to
re-route or it may forward the PathErr back upstream.

It might help if we reworded the draft to say "Any node may attempt
rerouting after it receives an error report and before it passes the error
report further upstream."

> 4.       Section 4.3 History persistence. If a repair point (a
> "routing/signalling node") is unsuccessful in a crankback attempt, is it
> possible for it to be not involved when another repair point (e.g.,
> closer to the source) succeeds in a crankback attempt. If so, how
> does the first repair point know to clear its history?

Note that the purpose of the history table as described in section 4.3 is
to correlate information when repeated retry attempts are made by the same
LSR. Suppose an attempt is made to route from A through B, and the
signalling controller for B returns a failure with crankback information.
An attempt may be made to route from A through C, and this may also fail
with the return of crankback information. The next attempt SHOULD NOT be
to route from A through B, and this is achieved by use of the history
table.

The history table can be discarded by the signaling controller for A if
the LSP is successfully established through A. The history table MAY be
retained after the signaling controller for A sends an error upstream,
however it is questionable what value this provides since a future retry
as a result of crankback rerouting should not attempt to route through A
(such is the nature of crankback). If the history information is retained
for a longer period it SHOULD be discarded after a local timeout has
expired, and that timer MUST be shorter than the timer used by the ingress
to re-attempt a failed service (note that re-attempting a failed service
is not the same as making a re-route attempt after failure).

As mentioned for point 2, the crankback information MAY be used to enhance
future routing attempts for any LSP, but this is not what section 4.3 is
describing.

We will try to clarify this in the draft.

> 5.       Section 4.5 Retries. Some guidance on setting the number of
> retries may be helpful as this is a distributed parameter. Is it set to
> be the same value at all points that can perform crankback within one
> network?

The view of CCAMP at the moment is that although it is technically
possible to allow the number of retries to be set for each LSP, this
probably represents too much configuration and too fine a level of
control. It seems likely that initial deployments will wish to set the
number of retries per node through a network-wide configuration constant
(that is, all LSRs capable of retrying will apply the same count) with the
possibility of configuring specific LSRs to have greater or lower counts.
Note that configuring an LSR not to be able to perform retries is
equivalent to configuring the retry count to be zero for that LSR.

It is also probable that initial deployments will significantly restrict
the number of LSRs within the network that can perform crankback
rerouting. This would probably be limited to "boundary" nodes.

In the event that implementations and deployments wish to control the
number of retries on a per LSP basis, we would revisit the signaling
specification and add the relevant information to the Path and PathErr
messages.

The actual value to set for a retry threshold is entirely a deployment
issue. It will be constrained by the topology and nature of the network.
It would be inappropriate to suggest a figure in this draft since there
are no hard and fast rules.

In review of section 4.5 of the draft, we see that there is some old text
describing more flexibility in the control of retries than we intend to
provide. Thank you for drawing our attention to this; we will clean it up.

Thank you once again for your feedback on this draft.
If you have further comments, we would certainly like to hear them. The
easiest way for individuals to contribute to the discussion of this topic
is by sending mail to the CCAMP mailing list. Details of how to subscribe
to this list can be found at
http://www.ietf.org/html.charters/ccamp-charter.html

Yours sincerely,
Adrian Farrel and Kireeti Kompella