RE: Comparison of restoration requirements between transport and packet networks

Hi Neil,

Please see comments inline

Thanks,

Richard.

-----Original Message-----
From: owner-ccamp@ops.ietf.org [mailto:owner-ccamp@ops.ietf.org] On Behalf Of neil.2.harrison@bt.com
Sent: Thursday, September 11, 2003 5:06 AM
To: rabbat@fla.fujitsu.com; ccamp@ops.ietf.org
Subject: RE: Comparison of restoration requirements between transport and packet networks

Richard,

In section 1 of your paper it says:

"This document presents a fault notification protocol that is both technology and topology agnostic, and applies to intra-domain protection. "

[Richard] What the draft meant is that we do not described the technology implementation of the flooding method described in it. Rather we keep the implementation separate. I’ll change to sentence to clear the misunderstanding

That being the case, wherever it is to be used one needs all the defects defining. Defects should be detected in the data-plane (and not by control-plane proxy) at the trail termination point using the OAM functions appropriate to the mode/technology, ie cnls is different to co-ps is different to co-cs. If one wants to go 'fast' (and I seriously question the sanity of those seeking to beat 50ms in SDH at higher layer networks) then only certain defects and technologies are relevant. Further, one should take care not to invoke protection/restoration for error events which self-clear.

[Richard] I wholeheartedly agree. 50 ms is most probably not doable in shared mesh networks. It may be doable in simpler configurations. The whole point of the draft is to *guarantee* a notification time. With all the layers doing some kind of protection and restoration at different time granularities, escalation some layers need to wait for lower layers to recover from a fault/defect before starting their own process. How does one define the time if there is no time guarantee? Should we assign a random value and hope for the best or should there be a time after which one is assured that the other layer did not accomplish its task and engage its own recovery mechanism. Assign 1 second or 200 ms or any time as being a hard bound, but make it hard.

The defects that we thought about when we wrote this draft are:

- Fiber cut

- Transponder failure

- Node failure

I hope this clears up the misunderstanding.

Some technologies are poorly specified wrt defects. I therefore don't believe your proposals are truly mode/technology generic as claimed. That was really the only point I was trying to make.

But here are a few further remarks on you paper if you are interested.........

I personally don't like the sound of the proposals as its a complexity that I don't think is necessary.

[Richard] Thanks for the interesting comments and a very good presentation of the problems at hand. I believe we are looking at a slightly different problem space within the context of protection/restoration. We’re going to look at these carefully and get back to you on these points you make.

{Aside - One of the major problems we operators face is the cost of complexity. Quite a lot of the stuff I am seeing are complexities aimed at BW squeezing.......which IMO have only a 2nd/3rd order potential benefit at the expense of 1st order complexity capex/opex costs. Well, BW may have been the right metric to conserve 10+ years ago but its not true today (in most cases). I am not pointing the finger specifically at your work here, but things like 'faster, faster, faster' restoration in all layers, having lots of QoS/traffic classes *per* network mode, and stuff like multiple class pre-emption/bumping are all examples of complexities whose costs outweigh their benefits IMO. We should use BW wisely to reduce complexity. This is a major focus point of the future network architecture views we are generating in BT, and at this point I don't really want to go any further on this issue on the lists.}

[Richard] w/r to this, we’ll describe in more detail the network model at hand.

A trail termination point is the only place defects can be detected in co-ps/cs modes. So the entity that you describe as 'per failure' vs 'per LSP' in section 4 is not strictly correct. I think what you really mean is the single failure of a trail in some server layer network generating multiple failures in all the client layer network trails it supports. This is a *recursive* behaviour.......and if you drew out the G.805 functional architecture I think you would quickly realise this and, in particular, that its the optical trail where your focus seems to lie (which I can understand), which creates a link-connection in the immediately above layer network).

Given a trail termination point is the only place defects can be detected, it is vital that one consequent action of defect detection at this point is the generation of a Forward Defect Indication (FDI)......sometime known as AIS. The whole purpose of this signal is to tell all the higher layer client networks (at *their* trail termination points) not to raise alarms....else this will casue major problems/opex-costs chasing faults in the wrong layer/place (this can be across different operators in different countries so its a pretty serious issue). This itself is a 'flooding' behaviour but it is constrained flooding only to the affected clients.....and not the rest of the server (or client) layer network(s) who don't really need to know about this.

In the case of a fibre-cut then FDI would go in both directions. By definition this must be the fastest form of signalling to inform the nodes at either end of the affected trail(s). In the case of uni-directional failures then FDI would go forwards and one would have to either use the BDI or some dedicated backwards signalling to inform the head end.

So given we *must* have FDI for client layer alarm suppression then I am not at all clear what benefits your proposal is giving that targeted fault notification will not achieve just as well.

BTW - we have done extensive testing of restoration schemes using signalling with crankback and simple routing (plus route pruning) processes.....and it works great.

regards, Neil

-----Original Message-----
From: Richard Rabbat [mailto:rabbat@fla.fujitsu.com]
Sent: 10 September 2003 23:12
To: ccamp@ops.ietf.org
Subject: re: Comparison of restoration requirements between transport and packet networks

Hi Neil,

Thanks for the email. I’d like to ask you to expand a bit on your idea on the mailing list. I’m afraid I’m a bit confused about the question and need more information.

Do you mean by your question:

- How do we figure out what the defect is and where it happened? Is it a fiber cut? Is it a misconfiguration? How do you do fault localization

Or:

- What set of defects does your solution address?

Or maybe something else.

Thanks,

Richard.

-----Original Message-----
From: neil.2.harrison@bt.com [mailto:neil.2.harrison@bt.com]
Sent: Wednesday, September 10, 2003 1:13 PM
To: rabbat@fla.fujitsu.com; ccamp@ops.ietf.org; zali@cisco.com
Cc: v.sharma@ieee.org
Subject: RE: Comparison of restoration requirements between transport and packet networks

Where are the defects specified (take any networking mode/layer you like) wrt entry/exit criteria and consequent actions?

regards, Neil

-----Original Message-----
From: Richard Rabbat [mailto:rabbat@fla.fujitsu.com]
Sent: 10 September 2003 20:11
To: ccamp@ops.ietf.org; Zafar Ali
Cc: Vishal Sharma; 'Richard Rabbat'
Subject: Comparison of restoration requirements between transport and packet networks

Hi Zafar, CCAMP,

During discussions with several colleagues within the CCAMP WG, it has become clear that it would be useful to clarify some of the fundamental differences between restoration in packet networks and that in transport networks.

This is because this difference, together with the time criticality of restoration at the transport layer, requires the development of techniques for time-bounded notification. It would then be useful to discuss the solutions proposed in draft-rabbat-fault-notification-protocol-03 for such notification.

We are in the process of preparing a contribution on this subject, but thought it would be useful to highlight a few key points on the mailing list, so that we can elicit feedback and comments from the WG.

In normal packet networks (MPLS networks) one can pre-signal *and* pre-configure a backup LSP for a working LSP. This is because selecting a label at a node for a backup LSP is sufficient to be able to switch traffic for that LSP when that traffic arrives. If resources are required for the backup LSP (buffers and bandwidth), they too can be reserved in advance (during the LSP signaling phase), but can still be used by low-priority or extra-traffic LSPs as long as there is no failure on the working LSP.

This is true even for shared mesh restoration in MPLS networks. In that case, multiple labels would be assigned, one for each of the backup LSPs (corresponding to link and/or node disjoint working LSPs) transiting a node on the shared backup path, but only one set of resources (buffers, bandwidth) would be reserved (if such resource reservation was needed).

In transport networks, however, one can pre-signal but not pre-configure a backup LSP (unless one was doing just 1+1 protection). This is because, in transport networks, if an LSP is established (that is, it is cross-connected) then the full bandwidth of the LSP is automatically *consumed*, irrespective of whether traffic actually flows on this LSP.

For this reason, to implement shared restoration schemes in transport networks (and allow extra-traffic) a backup LSP cannot be cross-connected until *after* the specific failure for which this backup LSP was pre-signaled has occured.

Now, if signaling-based notification is used in transport networks, an *additional phase of signaling* is required along the backup path to enable nodes along that path to reconfigure themselves (this is well-described in the functional specification document

of the P&R Design Team). This lengthens the time to recover from the failure. Depending on the layer at which recovery is being performed this may or may not be acceptable.

In the specific case of transport networks, restoration is typically a time-critical activity, so this round-trip signaling delay could be unacceptable when time-bounded notification and recovery is desired.

In addition, signaling individual LSPs or individual LSP bundles may create buffering problems that makes signaling time unbounded.

If instead, the information about a failure is flooded to all the network nodes, and the backup paths are selected intelligently (as described in draft-rabbat-fault-notification-protocol-03.txt), this additional signaling hand-shake delay can be eliminated. This is because by flooding the information about a fault on a working LSP, one can inform, in parallel, all the nodes lying along the path of the backup LSP. Thus, the repair point(s) upon learning of the fault holds off activating the backup LSP(s) for an appropriate time in which all nodes along the corresponding backup path(s) will have reconfigured themselves.

We would also like to get feedback on a suitable protocol that could implement time-critical flooding notification.

Comments, thoughts and questions are welcome!

--

Richard Rabbat, Ph.D.

Member of Research Staff, Fujitsu Labs of America

1240 E Arques Ave, MS 345, Sunnyvale, CA 94085

Phone: 408-530-4537. Fax: 408-530-4581. Cell: 650-714-7618