|
Hi Neil, Please see comments inline Thanks, Richard.
-----Original Message-----
Richard,
In section 1 of your paper it says:
"This document presents a fault notification protocol that is both technology and topology agnostic, and applies to intra-domain protection. "
[Richard] What the draft meant is that we do not described the technology implementation of the flooding method described in it. Rather we keep the implementation separate. I’ll change to sentence to clear the misunderstanding
That being the case, wherever it is to be used one needs all the defects defining. Defects should be detected in the data-plane (and not by control-plane proxy) at the trail termination point using the OAM functions appropriate to the mode/technology, ie cnls is different to co-ps is different to co-cs. If one wants to go 'fast' (and I seriously question the sanity of those seeking to beat 50ms in SDH at higher layer networks) then only certain defects and technologies are relevant. Further, one should take care not to invoke protection/restoration for error events which self-clear. [Richard] I wholeheartedly agree. 50 ms is most probably not doable in shared mesh networks. It may be doable in simpler configurations. The whole point of the draft is to *guarantee* a notification time. With all the layers doing some kind of protection and restoration at different time granularities, escalation some layers need to wait for lower layers to recover from a fault/defect before starting their own process. How does one define the time if there is no time guarantee? Should we assign a random value and hope for the best or should there be a time after which one is assured that the other layer did not accomplish its task and engage its own recovery mechanism. Assign 1 second or 200 ms or any time as being a hard bound, but make it hard. The defects that we thought about when we wrote this draft are: - Fiber cut - Transponder failure - Node failure I hope this clears up the misunderstanding.
Some technologies are poorly specified wrt defects. I therefore don't believe your proposals are truly mode/technology generic as claimed. That was really the only point I was trying to make.
But here are a few further remarks on you paper if you are interested.........
I personally don't like the sound of the proposals as its a complexity that I don't think is necessary.
[Richard] Thanks for the interesting comments and a very good presentation of the problems at hand. I believe we are looking at a slightly different problem space within the context of protection/restoration. We’re going to look at these carefully and get back to you on these points you make.
{Aside - One of the major problems we operators face is the cost of complexity. Quite a lot of the stuff I am seeing are complexities aimed at BW squeezing.......which IMO have only a 2nd/3rd order potential benefit at the expense of 1st order complexity capex/opex costs. Well, BW may have been the right metric to conserve 10+ years ago but its not true today (in most cases). I am not pointing the finger specifically at your work here, but things like 'faster, faster, faster' restoration in all layers, having lots of QoS/traffic classes *per* network mode, and stuff like multiple class pre-emption/bumping are all examples of complexities whose costs outweigh their benefits IMO. We should use BW wisely to reduce complexity. This is a major focus point of the future network architecture views we are generating in BT, and at this point I don't really want to go any further on this issue on the lists.}
[Richard] w/r to this, we’ll describe in more detail the network model at hand.
A trail termination point is the only place defects can be detected in co-ps/cs modes. So the entity that you describe as 'per failure' vs 'per LSP' in section 4 is not strictly correct. I think what you really mean is the single failure of a trail in some server layer network generating multiple failures in all the client layer network trails it supports. This is a *recursive* behaviour.......and if you drew out the G.805 functional architecture I think you would quickly realise this and, in particular, that its the optical trail where your focus seems to lie (which I can understand), which creates a link-connection in the immediately above layer network).
Given a trail termination point is the only place defects can be detected, it is vital that one consequent action of defect detection at this point is the generation of a Forward Defect Indication (FDI)......sometime known as AIS. The whole purpose of this signal is to tell all the higher layer client networks (at *their* trail termination points) not to raise alarms....else this will casue major problems/opex-costs chasing faults in the wrong layer/place (this can be across different operators in different countries so its a pretty serious issue). This itself is a 'flooding' behaviour but it is constrained flooding only to the affected clients.....and not the rest of the server (or client) layer network(s) who don't really need to know about this.
In the case of a fibre-cut then FDI would go in both directions. By definition this must be the fastest form of signalling to inform the nodes at either end of the affected trail(s). In the case of uni-directional failures then FDI would go forwards and one would have to either use the BDI or some dedicated backwards signalling to inform the head end.
So given we *must* have FDI for client layer alarm suppression then I am not at all clear what benefits your proposal is giving that targeted fault notification will not achieve just as well. BTW - we have done extensive testing of restoration schemes using signalling with crankback and simple routing (plus route pruning) processes.....and it works great.
regards, Neil
-----Original Message-----
|