[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: I-D ACTION:draft-soumiya-lmp-fault-notification-ext-01.txt



Hi Vishal,

Please see comments in previous mail to Richard et alia.

> > 2. Fault repair
> >
> >   I don't see anything in your draft that discusses fault repair. How
> >   does the reporting node revoke its fault report?
>
> The reporting node will invoke "fault repair" using normal channels
> (possibly just the routing protocol in use in the network). This is
> because the "fault repaired" info. isn't really time critical. It's
> only purpose is to make sure (as you observe below) that repaired
> resources don't remain stranded indefinitely.
>
> I think the existing P&R drafts don't make explicit mention of
> fault repair either, probably for this very reason. (They also assume,
> like we do, that the "fault repair" indication is a background activity
> that happens via normal channels.)

I agree, but my point is that when you spread the information distribution
between two protocols you have no way of knowing for certain wheter a
resource is available or faulted. Which notification will you believe?

I think that the need to resolve this conflict means that all of the points
I raised below are relevant.

> >   This would appear to be a requirement otherwise repaired
> >   resources will never be made available again. I think that when
> >   you add this function and add the necessary controls to ensure
> >   that each node has the right state (fault or no fault) you will have
> >   invented a link state protocol.
>
> Based on the explanation above, we don't really see the need to add
> this functionality to a fault notification protocol of the type proposed
in
> draft-rabbat-fault-notification-protocol-03.txt.
> So fault notification retains it lightweight character, as desired.
>
> >   At this point I don't see what LMP gives you that an existing
> >   link-state protocol doesn't already deliver. Certainly the
> >   speed of reporting of faults to every node in the network
> >   will be lost once you have to prevent "thrash" of fault and
> >   fault clear notifications.
>
> Given that the "no fault" indication isn't time critical, as discussed
> above, one still derives the benefit of lightweight fault notification
> using LMP.
>
> Some of the drawbacks of adapting a link-state routing protocol to
> achieve such notification were discussed in the thread a few ago,
> in response to Roberto Albenese's email (which also pointed out
> the difficulty of scaling a signaling-based notification scheme
> when there are multiple LSP/lambdas on a link, and they don't
> share the same ingress/egress pairs, thus reducing the effectiveness
> of bundling Notifys.)

Agreed, but...
Please distinguish between notifying repair points about a failed LSP and
notifying repair points about the availability of resources in the network.
Also note that Notify Request objects may be modified within the network to
change how Notify messages are propagated making fault notification
propagation approximate to your LMP flooding close to the fault and
approximate to one-for-one notification further from the fault.

> Finally, the "thrashing" you refer to would be an issue even in
> signaling-based notification. If you send a Notify message to a
> repair point and then learn that the fault has been repaired,
> what does a detecting entity do? I think it pretty much allows
> the recovery action to proceed.

Notify messages do not inform the CSPF database: That is updated by the
routing protocol. The Notify message may supply input to the CSPF
calculation (see the crankback draft).

> The assumption here is that there is a fault correlation phase
> preceding the notification phase. If the fault correlation fuction
> determines that a fault has occured that requires the initiation of
> a recovery action, I would think the action wouldn't really be
> turned off midway.
> Rather, the recovery action would proceed, and,
> upon learning of the fault repair (via normal channels), policy
> would have to decide how to use the repaired resources.

Again, the distinction is between LSP fault notification and resource fault
notification. no-one (I hope) is suggesting that an RSVP-TE message should
feed information into the traffic engineering database.

> >  In any case, it is not clear to me that
> >   every node in the network needs rapid notification of faults -
> >   only those nodes that constitute repair points for the LSPs
> >   that used the failed resource need to know quickly and they
> >   hear about the problem through a directed Notify message
> >   that must propagate faster than any hop-by-hop protocol
> >   can.
>
> Actually, that's an interesting point, probably worth discussing.
>
> The flooding-based notification ensures that the fault notification
> gets to the repair point in the minimum number of hops.

Not necessarily so. The LMP method (with out-of-band signaling)
may require many IP hops between each LMP-capable switch.

> (The signaling-based
> notification allows the message to follow the "shortest path" to the
> repair point, where the shortest path is based on metrics for plain IP
> routing, which don't necessarily ensure that it gets there in the least
> number of hops; it only ensures that it gets there following the "shortest
> path" per the applicable metrics.)

Recall that not all nodes on the control plane are (G)MPLS switches.
So this isn't as simple as it may seem. In the IP network it may be just one
hop from ingress to egress even though there are many LSRs on the LSP. This
is made more significant when LSPs are routed using TE considerations.

> The second observation is that in shared mesh recovery, not only repair
> points but all intermediate nodes along the recovery path need to learn
> of the fault (so that they may reconfigure their cross-connects to carry
> the working traffic coming down the recovery path).

Yes, but...
When notified of a fault, the repair path does not know that the working
path has failed and that the protection path is about to be activated unless
- it knows about the full route of the working path
- it knows about all other faults in the network
- it knows about the policy applied at repair points
(consider, at least, n protects m situations)

> Using signaling-based notification, this requires an additional 2 or 3
phase
> handshake between the LSP end-points, which lengthens recovery time.

Yes. This a considerable draw-back of "extra traffic" usage. That is why
this is a lower grade protection service comapred to straight 1:1 or 1+1
protection.

> Additonally, there is the issue of scaling signaling when a large number
> of LSPs are affected by a given fault (e.g. a fiber cut).
>
> However, using controlled flooding-based notification, as proposed
> in draft-rabbat, it is shown (in the draft) that by an adequate choice
> of recovery paths, it's possible to ensure that all nodes along the
> recovery path learn of the fault (on the working path) by the time the
> end-nodes learn about it. This has two benefits: multiple nodes along
> the recovery path that learn of the fault may reconfigure themselves in
> parallel (saving time), and no additional handshaking is needed once the
> repair point learns of the fault.

Sure they will all learn of the fault, but how will they learn which LSPs
are impacted?  The implication is that every node has a full label-enabled
RRO available to it and that each time it receives a fault report from
anywhere in the network it processes all RROs in all LSPs that transit it.
This is non-trivial processing.

> And, of course there is the potential advantage that flooding allows for
> (in the worst case) no more than one message per link per fault, whereas
> signaling will require (in the worst case) a number of messages equal to
> the no. of LSPs to be sent by the detecting entity to the LSP repair
points.

But, as in my email to Richard, this could be looked at the other way.

In the worst case, flooding requires a message and processing on every link
and node in the network, where in signaling (in the worst case) only two
messages are sent for each impacted LSP.

Cheers,
Adrian