> Ok. There are situations where a connection may be up, but the > application is unresponsive. It would be good to use the RFC 3539 > method to validate the connection. The watchdog is based on application-layer timescales (e.g. 6 seconds), rather than connection timescales. So it's possible for the watchdog to cause application layer failover prior to connection failure/reset. > I'm not sure having a separate connection for Status-Server is a good > idea. I'm not sure either. Separating the Status-Server traffic from the RADSEC traffic breaks "fate sharing" which could introduce a number of bugs. > In addition, the algorithm in 3539 appears to be focussed on keeping > the connections up... even if that means re-opening them. I'm not sure > this is a good idea. It means that spikes in traffic cause a large > number of connections to be opened... which then never close, or are > continuously re-opened. Even if there's no traffic on them. The idea is to always have a connection "ready" for traffic, so yes, the algorithm does keep connections up even if there is no regular traffic (e.g. the algorithm generates watchdog traffic). > It may be worth adding suggestions: > > - TCP connections SHOULD be kept "full". i.e. used in a "most recently > used" fashion for normal RADIUS traffic. > > - The RFC 3539 watchdog algorithm should be used to determine the status of a *connection*. Not sure that the watchdog really determines connection status so much as status at the application layer. > - so long as one connection is alive, the server should be marked "alive". Agreed. But doesn't this somewhat conflict with the previous goal? > - connections that haven't been used for T seconds (4 * RTT?) may be > pro-actively closed. How do you know what RTT is? Or do you assume RTTMAX? Since routing transients can take as long as 30 seconds to resolve, T probably would need to be significant (e.g. minutes). > - at least one connection should remain open to determine application > responsiveness. Sure. |