[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Last Call: RADIUS Support For Extensible Authentication Protocol(EAP) to Informational



Overall, I agree with Glen's comments and would therefore recommend that
the paragraph in question be removed from RFC 2869bis, and that a separate
document be developed to specify RADIUS retransmission behavior in more detail.

> the TCP RTO algorithms are designed to prevent network congestion, which
> (though it may occur) is not the problem we're trying to solve here,
> which I think might be better modeled as a server resource contention
> problem.

The basic principle embodied in the TCP RTO algorithms is "conservation of
packets" -- that a new packet should not enter the network until there is
reason to believe that another packet has left it. Both a server resource
contention problem and congestion manifest themselves in terms of increased
delays and packet loss, and so it seems to me that the principle applies
in both cases -- and can be addressed dynamic timeout estimation, backoff
and jittering.

> Furthermore, the problem is not specific to EAP-over-RADIUS;
> in the situation mentioned above, the same behavior would be observed
> regardless of the authentication protocol in use.  This fact suggests
> that a solution to the problem (presupposing that a protocol-based
> solution is reasonable) should be published as an update to RFC 2865,
> rather than RFC 2869.

I agree with Glen that the problem can occur with any RADIUS
usage (authentication or accounting), and therefore is not specific to
RADIUS/EAP.  It therefore would be best to handle this as a separate
document.

> There are other reasons why the RFC 2998 algorithms are inappropriate
> for use with RADIUS. Some are mentioned in RFC 2865 (section 2.4);

Section 2.4 mentions that TCP is too aggressive in terms of
retransmission.  However in practice we are seeing RADIUS client
implementations that are much *more* aggressive than TCP. For
example, one RADIUS client from a well known vendor has a default RTO of 1
second with no backoff.  Several thousand of these clients recently
caused a service outage on our network after a network-wide password
reset resulted in a RADIUS server overload.  Since the clients did not back off
(or jitter)  the result was extremely high load as clients kept retrying
EAP authentication without success, hammering the servers until we had to
change the network-wide configuration to bring the network online again.

The other thing mentioned in section 2.4 is that faster failover is
desired than what TCP would provide.  It seems to me that RTO
estimation and failover are orthogonal issues so that doing dynamic RTO
estimation and backoff need not adversely affect failover algorithms.

> others include an initial TMO that is likely to be too short in
> situations where one or more RADIUS proxies are traversed, the large
> granularity of the timers specified and the deterministic nature of the
> algorithms used which in the worst case could result in all the clients
> firing repeated salvos of requests in lockstep (not a good way to reduce
> instantaneous server loading!).

I agree that jitter is required here, and that the initial RTO might be
set to a higher value (say, 5 seconds).