[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Transport networks and draft-ietf-mpls-generalized-rsvp-te-05



One characteristic of traditional transport networks is that
control plane failures do not affect established lightpaths or
TDM channels.  This is reflected by requirement 144 in
draft-ietf-ipo-carrier-requirements-00: "Control plane failures
shall not cause failure of established data plane connections."
Note that there is no time limit imposed on how long a control
plane failure can last without affecting established data plane
connections.

The Restart Time field in the Restart_Cap object supports a
maximum value of about 49 days.  In case of transport networks,
this would mean that in case of a controller failure due to say,
a natural disaster, if the node is not accessible, established
connections will fail after 49 days even if data plane is not
affected, which is unacceptable.

Even if this is fixed so that a node can advertise that the
restart time is infinite, there is another issue.  In existing
transport networks, established data plane connections do not
fail even if multiple controllers fail.  Accordingly, the
requirement from draft-ietf-ipo-carrier-requirements-00 quoted
above requires that there be no failure in the data plane in case
of "control plane failures."  This would effectively mean that
the recovery time could also be indeterminate. Consider the
following example with four nodes:

                      A ---- B ---- C ---- D

If B and C fail, and C restarts, it can not complete
synchronization with D until B restarts and exchanges state with
C.

I believe this level of resiliency is not possible with the
procedure specified.  Just advertising 0xffffffff for recovery
time will not be enough, as this could cause state to be held
indefinitely when it should be deleted.

So I suggest one of the following:

  - remove the restart section from the draft and consider it
    as a separate draft, so that the rest can move forward.

  - add a constraint in the beginning of section 9.5, after the
    first sentence:

    "The recovery mechanism specified in this section addresses
    restart issues only in case of PSC devices, and does not
    fulfil the requirements for transport networks.  Recovery
    procedures for transport networks is for further study."

  - postpone the last call, and look at adding a solution that
    would also satisfy the requirements of transport networks.

 From a transport network perspective, this could be a starting
point for discussion:

  - A new object called PERSISTENT_SESSION is introduced to
    indicate that a session is long lasting.  This will be sent
    in the Path message associated with transport network
    sessions.  This will be reflected back in the Resv message.
    The fact that a session is persistent is preserved along with
    forwarding plane information.

  - A new message (LRefresh) is introduced to synchronize the
    forwarding plane information for persistent sessions between
    peers, and it is transmitted periodically.

  - A new object called flags is added and this will be carried
    in Hello messages. This object has one field called flags,
    and a node supporting persistent sessions will set bit 0 of
    the Flags field.  Support for persistent sessions implies
    support for the PERSISTENT_SESSION object as well as
    LRefresh.

  - For sessions that are marked persistent, state will be
    cleared only under these conditions:

    - State will never be cleared based on timeout. Exceptions
      listed below.

    - State will be cleared based on timeout only after sending
      the Resv with the D bit set in the Admin_Status object
      upstream, and only if graceful deletion was initiated by
      the headend. This information will be preserved along with
      forwarding plane information.

    - Reception of PathTear.

    - If a node learns that it has some state in the forwarding
      plane that is not in sync with that of the peer, the
      forwarding plane state as well as associated control plane
      state will be cleared on a timer basis.  This timer will be
      more coarse-grained than the refresh timers for the control
      plane.

    - If a peer does not support persistent sessions, state that
      has to be refreshed by that peer will be timed out in the
      usual manner.  Restart procedure as specified for PSC
      devices may be used if the two nodes support it.

  - Specify clearly what help is needed from outside RSVP. To
    support recovery of hierarchical LSPs, RSVP needs the
    following information upon restart:

    - Mapping between interface id. assigned by the peer and the
      interface id. assigned locally.  (Related qn.: There was a
      suggestion on the list (from Yakov) that a node can learn
      this from the recovered link state database. But if the
      restart time is going to be indeterminate, isn't it
      possible that the TE-LSAs have expired when the node
      restarts?  If so, a node might instead preserve this
      information across restart.  But in any case, RSVP needs
      this information upon restart.)

    - Peers (LSR ids) with which Hello adjacencies were
      established dynamically as a result of FA-LSPs.


Comments?


Cheers,

Gopal