Security

 View Only
  • 1.  Recommended Cluster design(Publisher role)

    Posted May 21, 2019 09:27 PM

    Hi guys,

     

    we have been running a 3 node cluster for the last 3-4 years and have had growing issues moving from version to version and also activating OnGuard as a Posture mecanism.

     

    We've had multiple loss of access to our publisher as the load simply skyrockets and the httpd process is completely hammered by the WebAuths. Here is the configuration we have :

     

    C2000 x 3

     

    1x Publisher

    2x Subscribers

     

    All load-balanced across the servers. InsightDB only on a single subscriber as, again, low being an issue.

     

    Since httpd is used for WebUI and OnGUard, is it recommended to have the Publisher node do the smaller lifting(TACACS,Management) and leave Radius/Webauth to the subscribers?

     

    I am unsure if the newer versions are more taxing on the servers or it is simply bugs, but we have come across 2-3 times in the last few weeks where our WebUI is almost non-responsive, the server rejects TACACS because of load and we are unable to do anything since we can't access the WebUI. 

     

    Thinking of the Wireless controller approach, we thought maybe it would be a good idea to completely remove WebAuth from the Publisher so that it would never impact the HTTPD process and move TACACS only to that server for now. Typically, the servers are stable and don't tend to crash, but since this version (6.7.9 now) we have had some load issues that caused short outages and today we had a 9 hour long call with TAC to regain access to the publisher since it lost it during the weekend and after the 24 hour period, the subscribers became Publishers... everything went bad. 

     

    Everything is back now, but if feels as though this will repeat again unless we modify the current setup. We may be looking into adding a 4th node to split the load even more and have a failover publisher and TACACS redundancy, but that is down the road. 

     

    All this to ask.. is the typical design to have the publisher act like a Master controller and leave the grunt work to the Subscribers? I never came across this in the past as OnGuard was never in play and so the HTTPD process was not taxed the way it is now.

     

    We have around 4000 devices, probably 1500-1700 concurrent.

     

    Any insight would be appreciated.



  • 2.  RE: Recommended Cluster design(Publisher role)

    Posted May 22, 2019 01:30 PM

    Hello BCote,

     

    Based on what you mentioned:

    " C2000 x 3

    1x Publisher

    2x Subscribers, 

    We have around 4000 devices, probably 1500-1700 concurrent."

     

    You should not be running into this issue, I think there is some underlying issue that needs to be root caused, i would request you to ask the TAC engineer to find the root cause of the http crashing on the pub.  could it be any old onguard agent clients, sending crazy number of http get requests? as they are not upgraded or they have older versions?  either way, this need to fixed first, and yes, based on your server's specs C-2000, each should be able to handle 5k auth. so you can keep pub for mgmt/tacacs authentication and the subs for the rest of the things.. 

     

    However, i would recommend first finding the root cause of the high CPU, would solve your issue.

     

     



  • 3.  RE: Recommended Cluster design(Publisher role)

    Posted May 22, 2019 02:09 PM

    Hi Fayyaz,

     

    thank you for your input. That is what the TAC engineer seems to see as well. Some devices seem to "hammer" the servers continuously and once enough devices flood the HTTPD process, it basically kills management and the server. 

     

    Although my own device is running 6.7.9(same as server), I did appear in the logs as a system hammering the server, not alot, but still I was part of it. 

     

    I am asking our client computing team to run a report on all our devices to see if they can still find old 6.7.5 versions that were not updated. Its possible they are the worst impactful users.

     

    To give you an example, TAC ran a regex type command to list the devices coming back the most on the servers and this one came in

     

    45651 172.20.134.33

     

    thats 45651 times that the device is communicating with the server. I am working right now to grab the agent logs from that device and confirm the version, but as you mentioned, there is definitely some form of bug with this version or a version difference causing an issue with the agents trying to communicate. 

     

    We push all our agents through SCCM and do not do any auto-update. 

     

    I'll keep you updated once I get more details from Engineering. I sent alot of logs to them today after this discovery.



  • 4.  RE: Recommended Cluster design(Publisher role)

    Posted May 22, 2019 02:53 PM

    yes, Bcote.

     

    Using the logs, they should be able to narrow it down. they could also find which client is doing this, using clearpass logs as well. Let me know if it is taking long time, i can increase the priority of the ticket for you. hopefully they should be able to find a solution for this problem soon and you should be good then..

     

    --