[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Control plane resiliency

To: <ccamp@ops.ietf.org>
Subject: Control plane resiliency
From: "Adrian Farrel" <adrian@olddog.co.uk>
Date: Mon, 31 Oct 2005 22:37:34 -0000
Reply-to: "Adrian Farrel" <adrian@olddog.co.uk>

Hi,

It is nice to see some traffic on the CCAMP list.

Some observations based on your discussions (in no particular order).

1. There is a distinction to be made between control channel resiliency
and controller resiliency.

Since the control channel uses IP, it will heal so long as there is IP
connectivity (failing that you're dead anyway); the issue is the speed of
convergence. Other solutions exist to provide control channel redundancy,
but I believe that these are either implementation specific or are
hardware specific. In either case, they are out of scope and (as Zafar
says) the IP connectivity would remain.

Controller resiliency is actually a debate about how to manage the data
plane in the absence of a controller. Note that it is perfectly possible
to lose your signaling controller without losing your routing controller,
etc.

2. There is no need to make a distinction between data plane technologies.
That is, for GMPLS this applies equally to PSC and non-PSC.

3. Dimitri's observation that it may be worthwhile documenting the
concerns and explaining why they are not issues (i.e. how they solve
themselves) may be very valuable. It should suit all people because if the
concerns are real then the attempt at documentation will expose them,
while if the concerns are not real we will have an advisory document for
all time.

4. *The* controller resiliency issue appears to be: How to manage and
teardown an LSP downstream of a broken controller. I do not understand why
the answer is not one of:
a. Wait for the controller to be repaired
b. Use the management plane
It seems that in-place modification of GMPLS LSPs is not a common
operational feature for an active LSP. This leaves us with teardown. For
teardown, option a does not seem to be very painful, but if it does cause
problems (and alarms *will* be raised) option b is available.

5. The ability to manage the data plane resources at the node where the
control plane has failed, may still exist in the management plane (in
which case use it!) or may also be gone (in which case, who cares that you
can't free up the resources for another user since another user cannot
make use of the resources). You *might* manage to claim that there is a
safety issue here - anyone know what OSMINE says about loss of management
plane connectivity to a device with lasers?

6. There may be a *separate* issue (as raised by Tom) about how to synch
the control plane with the data plane. This might be particularly
important after controller recovery. Such OAM features should be part of a
separate effort (such as a GMPLS OAM requirements draft).

7. Lyndon points out that there is scope for a "standby signaling
controller". This could mean two things.
a. An implementation option (as John points out) including, but not
limited to:
   i. multiple instances of the software component
   ii. multiple instances of a controller CPU
   iii. control-plane message replication to distinct devices
b. A protocol option where two entirely different "routers" manage the
same data plane switch. This is pretty much outside our architecture
(although we could handle a "replacement signaling controller" if it
turned up with the same addresses). *If* someone wants to work on this
idea, I suggest they need to develop the requirements and convince the
community that what we have already doesn't provide everything we need
anyway. (Not saying that the standby model is wrong, but am suggesting
that we might not have a need for it at the moment.)

8. In a transport network, we must accept an operator's right to entirely
eliminate soft state timeouts. That is, the failure of the control plane
for any length of time, must not impact the traffic (note that even though
RSVP is a soft state protocol, the refresh period may be set so high (32
bitsworth of milliseconds) as to be practically infinite. Note
specifically that the failure of Hellos is used in RFC3473 to temporarily
"turn off" the soft state nature of RSVP-TE.


In summary:

A. We have decided that observation 3 might be useful. Is anyone planning
to work on this?
B. We have not heard from Young since the start of this debate. I hope the
hot words have not driven him away.
C. Igor appears to be out on a limb here. While I appreciate him pushing a
point that he believes in strongly, such determination in the face of WG
consensus approaches a DOS attack (unintentional, I'm sure) on the IETF
process. Please be careful.

Thanks,
Adrian

Follow-Ups:
- Re: Control plane resiliency
  - From: "Igor Bryskin" <ibryskin@movaz.com>
- Re: Control plane resiliency
  - From: George Swallow <swallow@cisco.com>

Prev by Date: RE: comments on draft-shiba-ccamp-gmpls-lambda-labels-00.txt
Next by Date: Re: Control plane resiliency
Previous by thread: Last Call: 'Reoptimization of Multiprotocol Label Switching (MPLS) Traffic Engineering (TE) loosely routed Label Switch Path (LSP)' to Informational RFC
Next by thread: Re: Control plane resiliency
Index(es):
- Date
- Thread