Wireless Access

 View Only
Expand all | Collapse all

AAA Dead Timers

This thread has been viewed 55 times
  • 1.  AAA Dead Timers

    Posted Sep 22, 2025 06:58 AM

    I've logged a TAC call about this but I'm getting nowhere and this is the second time I've logged it. 

    We have two clusters (A/B and C/D) and four ClearPass servers (CPPM-A1, A2, B1 and B2). Due to work at our data centre last week, we disabled clusters A and C and failed over to B and D. CPPM-A1 and A2 were also switched off and the router ports disabled. #

    We identified that the controllers were still trying to send authentication requests to the ClearPass servers which are off. The AAA dead timer is set to 10 minutes.

    (MM-8B) [mynode] #show aaa timers

     Global User idle timeout = 300 seconds

    Auth Server dead time = 10 minutes

    Logon user lifetime = 5 minutes

    User Interim stats frequency = 600 seconds

    My understanding is that after the 10 minutes has elapsed, authentication requests should not be sent to the "dead" ClearPass servers, however, that doesn't seem to be occurring.

    Sep 18 08:49:03 2025  dot1x-proc:2[9451]: <124004> <9451> <DBUG> |dot1x-proc:2|  aal_auth_raw (1322)(INC) : os_auths 1, s cppm-a2 type 2 inservice 1 markedD 0 sg_name cppm_srvgrp
    Sep 18 08:49:03 2025  dot1x-proc:2[9451]: <124004> <9451> <DBUG> |dot1x-proc:2|  aal_auth_raw (1325)(INC) : os_reqs 8, s cppm-a2 type 2 inservice 1 markedD 0 

    Sep 18 08:49:03 2025  dot1x-proc:1[9448]: <124004> <9448> <DBUG> |dot1x-proc:1|  aal_auth_raw (1322)(INC) : os_auths 2, s cppm-a2 type 2 inservice 1 markedD 0 sg_name cppm_srvgrp
    Sep 18 08:49:03 2025  dot1x-proc:1[9448]: <124004> <9448> <DBUG> |dot1x-proc:1|  aal_auth_raw (1325)(INC) : os_reqs 9, s cppm-a2 type 2 inservice 1 markedD 0 

    Sep 18 08:49:04 2025  dot1x-proc:2[9451]: <121004> <9451> <WARN> |dot1x-proc:2| |aaa| RADIUS server cppm-a2 server-group cppm-6.11_srvgrp --IPaddress-1812 timeout for client=MACaddress auth method 802.1x
    Sep 18 08:49:04 2025  dot1x-proc:2[9451]: <138236> <9451> <DBUG> |dot1x-proc:2|  wpa3 0 wpa2 1 wpa 0 apape 2ptk 0 gtk 0 onlyptk 0 newauth 0 psk 0 wired 2

    Sep 18 08:49:05 2025  authmgr[8790]: <121004> <8790> <WARN> |authmgr| |aaa| RADIUS server cppm-a2 server-group cppm_srvgrp --IPAddress-1812 timeout for client=MACaddress auth method MAC


    We're running 8.10.0.19. Has anyone else seen this?



    -------------------------------------------


  • 2.  RE: AAA Dead Timers

    Posted Sep 22, 2025 07:38 AM

    If the CPPM nodes are intentionally offline, don't rely on the dead-time alone. The controller will still cycle through all servers in the group. Easiest fix is to just pull those nodes out of the server-group (or disable them) until they're back online:

    configure terminal
    aaa server-group cppm_svgrp
    no auth-server cppm-a1
    no auth-server cppm-a2
    exit
    end
    write memory

    This way the controller only talks to the active ClearPass nodes and you won't see the repeated 1812 timeout messages. If you have a cluster VIP, even better-point the controller to that so failover is handled inside ClearPass.

    Cheers,

    Vigan

    -------------------------------------------



  • 3.  RE: AAA Dead Timers

    Posted Sep 22, 2025 10:22 AM

    Hi Vigan,

    Thank you for your response.

    Yes, once I identified the issue, I removed both servers from the server groups on both clusters.

    The purpose of the dead timers is to avoid the need for manual intervention, especially if this were to happen outside of working hours.

    Although we use the ClearPass VIPs for the captive portal and to access the active publisher, we don't use them for authentication due to the high volume of RADIUS requests they handle daily. At peak times, we have up to 50,000 clients connecting.

    The configuration is designed to ensure resilience and reduce operational overhead during high-load or out-of-hours scenarios.

    -------------------------------------------



  • 4.  RE: AAA Dead Timers

    Posted Sep 22, 2025 03:04 PM

    The dead timer is how long the server will be marked as out of service once that has been determined.  Once the timer has expired the controller will return that server back to the in-service list and the server will be tried again.



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------



  • 5.  RE: AAA Dead Timers

    Posted Sep 30, 2025 02:57 AM

    Sorry for the delay, rather swamped at the moment.

    Thank you! I was misunderstanding it. 

    I'm keeping my TAC case open though as it should ideally be tested with a spoof RADIUS request rather than a legitimate one! 

    Based on your response and @vigan's above, I have formulated another workaround in the mean time. 

    Thank you both!
    -------------------------------------------



  • 6.  RE: AAA Dead Timers

    Posted Sep 30, 2025 05:01 AM

    Did you try to implement radius tracking? It should provide exactly what you are looking for. It will test radius server availability and mark unresponsive ones. At least it solve the problem for me. Have no problems with dead radius servers as they stay marked as dead until reachable again by access tracker.

    Best, Gorazd



    ------------------------------
    Gorazd Kikelj
    MVP Guru 2025
    ------------------------------



  • 7.  RE: AAA Dead Timers

    Posted Sep 30, 2025 05:26 AM

    I didn't because we running AOS 8. As far as I can, and I may be wrong, but wasn't the radius-server tracking command introduced in 10.07?

    -------------------------------------------



  • 8.  RE: AAA Dead Timers

    Posted Sep 30, 2025 05:33 AM

    I believe that corresponding feature in AOS8 is aaa test-server

    https://arubanetworking.hpe.com/techdocs/CLI-Bank/Content/aos8/aaa-test-srv.htm

     Best, Gorazd



    ------------------------------
    Gorazd Kikelj
    MVP Guru 2025
    ------------------------------



  • 9.  RE: AAA Dead Timers

    Posted Sep 30, 2025 09:37 AM

    The 'aaa test-server' feature is a command for manually validating an AAA server configuration, that is not a configuration item nor am I aware of a server tracking option.



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------



  • 10.  RE: AAA Dead Timers

    Posted Oct 14, 2025 06:56 AM

    TAC has confirmed that the feature I'm looking for does not exist in AOS 8. They recommend submitting a feature request if we'd like it to be considered for future development.

    "I am afraid to inform you that there is no keepalive mechanism available currently in 8.x architecture design.
    We track the Radius Servers availability via authentication response from server. If there are multiple timeouts , the controller will mark it as dead."

    They did suggest a workaround: enabling RadSec, which uses TCP instead of UDP, thereby requiring a reliable connection before authentication requests are sent. I've successfully tested this in our development environment.

    I'm hesitant to enable it in our production environment due to the high volume of authentications our ClearPass servers handle. A separate TAC ticket confirmed that our 4 N3001 subscribers should be able to handle up to 1.8 million non-guest RADIUS using RadSec requests, which is reassuring. 

    -------------------------------------------



  • 11.  RE: AAA Dead Timers

    Posted Oct 14, 2025 07:31 AM

    If you'd rather avoid RadSec, one workaround is to tune the RADIUS retry and timeout values on both the controller and ClearPass.


    Set slightly higher retry_interval and deadtime on the controller, and reduce the response timeout on ClearPass so it replies faster under load.

    Also, make sure your ClearPass cluster or server group is configured with multiple RADIUS hosts for redundancy - that helps smooth out transient UDP drops without false dead-server triggers.

    The other solution is of course gateway of last resort, to be done with RADSec and using certificates signed with a Internal CA.

    Cheers,

    Vigan

    -------------------------------------------



  • 12.  RE: AAA Dead Timers

    Posted Oct 14, 2025 08:28 AM

    Hi Vigan,

    Thanks for your response. I'm a bit confused, as we're not seeing any false dead-server triggers, and we have four N3001 CPPM appliances to handle authentications and any proxying.

    The issue we're encountering is a lack of redundancy. When a host goes down, authentication requests are still being sent to that (now dead) server. The problem arises after the dead-timer period lapses, the server is automatically marked as alive again without verifying its actual availability.
    This leads to legitimate requests failing unnecessarily. 

    I'm waiting for TAC regarding AirGroup as there are log entries that suggest that this is affected too.

    Best regards,

    Anthony




  • 13.  RE: AAA Dead Timers

    Posted Oct 14, 2025 10:49 AM

    Note, if you do go with RadSec the limitation to be concerned about isn't the total capacity of the ClearPass cluster but the individual server authentication per second rate.  The overhead of RadSec will reduce that value by some amount.

    Your best solution for this is going to be a load balancer appliance or other device that can provide application load balancing.



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------



  • 14.  RE: AAA Dead Timers

    Posted Oct 14, 2025 11:47 AM
    Edited by vigan Oct 14, 2025 11:47 AM

    Hey man,

    According to the ArubaOS 8.6.0.0 User Guide (pp. 204–206), the controller marks a RADIUS server as down after consecutive retries and keeps it in that state for the configured dead-time (default 10 minutes). When the timer expires, the server is automatically reactivated without a reachability check, which can cause renewed authentication failures if the server remains offline. 

    To mitigate this, increase the dead-time and enable load balancing within the AAA server group so traffic is distributed only among responsive ClearPass nodes.

    I don't know if you have load-balancing enabled at this stage but if not here is how: 

    aaa server-group <group>
    load-balance
    auth-server <cpass01>
    auth-server <cpass02>

    At this stage I do not see anything on the official documentation that gives a straightforward fix to your exact issue.

    Give this a try if you haven't already and see if it at least mitigates or resolves the issue.

    Cheers,

    Vigan

    -------------------------------------------



  • 15.  RE: AAA Dead Timers

    Posted Oct 21, 2025 10:16 AM

    Hi both,

    We have four N3001 ClearPass appliances that are load balanced. Also, I've been told that the "sensible" function of checking that a server is back in service doesn't exist . RadSec is a workaround (as it uses TCP rather than UDP). We could increase the dead-time, but that's not necessarily the best solution, it would only postpone the issue rather than resolve it.

    We're currently holding off on enabling RadSec due to CPU issues on our routers and want to avoid adding any additional load to them at this stage.

    -------------------------------------------



  • 16.  RE: AAA Dead Timers

    Posted Oct 21, 2025 10:46 AM

    Without implementing an actual load balancer between ClearPass and the NADs, about the easiest option is to configure multiple VIPs (one per cluster node) and set the backup such that hopefully the IP address stays active through any planned outages.  That would also give you the opportunity to change the backup target from the ClearPass side rather than editing NAD configuration.



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------