cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1698
Views
0
Helpful
5
Replies

ISE uptime - not 99,999 but 100 %?

tuenoerg
Cisco Employee
Cisco Employee

In short the case is as follows – the customer rebooted the PAN during business hours and unfortunately this happen to be at the same time the reauthentication timer was hit. So the consequence was that a series of broadcasting devices ended up in a critical vlan. To be precise – devices handling live streaming of radio.

The case was made worse since the configuration missed the commands :

authentication event server alive action reinitialize

authentication event server dead action authorize vlan ‘vlan id’

We are adding that – and we are working on introducing C3PL config and load balancers (Netscaler)

All in all we hit problem of services not working (slow responses) which ends up radius server being detected as down on the switches – hence – devices suddenly not responding in timely fashion.

We know about the timer (5 minutes) for “heartbeat” of PAN from PSN testing.

Is that timer going to be less in the future ?

We know its documented in the admin guide that some services is down when PAN is down.

So I hope you can help be clarify a few things :

If we do a switchover to secondary administrative node – does that give the same “outages” while promoting the other server as PAN ?

I have been over the BRK3699 session notes – but it does not state anything about the above?

Best practice for timers to minimize the impact of a PAN reboot or failure ?

In ISE 2.x the automatic failover of PAN is not enabled by default. Do we recommend that – and if so – what are the impact/consequences?   

Basically we need to make sure we can get 100 % uptime for 802.1x and MAB with profiling of devices – also for devices in the studioes and live streaming – and I would love to explain more and initiate a dialog with some of you on how to achieve 100 % uptime.

One last thing is the recommended setting for the load balancing parameters:

load-balance method least-outstanding xxx

What is the recommended number?

Best regards

Tue Noergaard

1 Accepted Solution

Accepted Solutions

Per internal discussions, there will be a window based on the Auto-FO timers where some services may be disrupted if Primary PAN is unavailable.  Standard AAA services continue but there are some services like Guest self-registration, device registration, or other services that are contingent on replication which could be impacted during the PAN cutover.

The specific issue of auth delay with guest due to PPAN down was partly due to NAD timers.  It is possible to configure longer timeouts, or else revert to a critical auth mode to apply local auth policy when RADIUS unavailable.  To achieve a higher availability for advanced ISE features today (as I cannot speak about roadmap in this forum) would be to fail over to a separate deployment.  This could be a backup ISE instance at the data center intended to take full load, or only the sporadic cases to accommodate the events impacted during the ~15 minute cutover.  It is also possible to deploy a local standalone server to handle all broader failure cases such as complete WAN outage.

Craig

View solution in original post

5 Replies 5

hslai
Cisco Employee
Cisco Employee

Is your primary policy admin node (PPAN) also the primary RADIUS configured on the switch? Otherwise, it should not have impacted the authentications.

The timer for PAN failover is not configurable. It's not good to set it too short, either. If you wanna make a case for enhancement, please contact our PM team.

A swtichover to the secondary PAN is to promote it as primary. So, they are the same. Or, are you asking something else?

As to the auto-failover for PPAN, it's not enabled by default as an ISE admin user needs to elect which 3rd node used for monitoring. It's good to enable it for deployments to shorten the down time for those service features not working when PPAN is down.

As your question on the RADIUS load-balance on Cisco IOS, please see the explanation @ Demystifying RADIUS Server Configurations - Cisco

HI,

The PAN node is not a radius server on the switch - but the customer hit the reauthentication timer at that time (and we end up with a radius server dead) seanario...

There is a section in the admin guide explaining what does not work when the PPAN is booted or is down.

But - I need to investigate what we can do to get 100 % update on all features - since the customer needs that - or we need a secure workaround on their streaming devices for live radio/tv and so on.

I also need to find out what really happens when we do an switchover of the primary PAN to the secondary and the timing involved. Will that shorten the time and if yes - to what value ?

Best regards

Tue

Since PAN not acting as a PSN, it should not have caused RADIUS auth outages. Please involve TAC to investigate.

If promoting the secondary PAN to become primary while PPAN still up, both PANs will be restarted. This restart timing depends on the hardware resource, as bounded by CPU, memory, and I/O.

TAC is on the case - and their conclusion was that we the restart caused the outages.

My main focus is still to be able to design a setup that takes this issue into account - whether that means implementing load balancers, other switch config and timers.

I have simply not been able to find how to mitigate the issue at hand - or other hitting the same issue.

Therefore I´m reaching out here to see - if we can suggest other configuration or setup that will provide 100 % update or at least resilience against the issue with service not working when PPAN is down or booted (basically all features mentioned in admin guide not working when PPAN is booted).

If the design will cost a crazy amount of money - that´s ok - then they can decide if it´s that important.

We have already gotten a long way with better config - changing to C3PL config - autorecover from dead radius servers and so on.

Now we are looking into other reauthentication timers for these specific groups of device - to minimize the likeliness that we hot the scenario again.

So - anyone got a ISE design that provides 100 % uptime - regardless of cost ?

Best regards

Tue

Per internal discussions, there will be a window based on the Auto-FO timers where some services may be disrupted if Primary PAN is unavailable.  Standard AAA services continue but there are some services like Guest self-registration, device registration, or other services that are contingent on replication which could be impacted during the PAN cutover.

The specific issue of auth delay with guest due to PPAN down was partly due to NAD timers.  It is possible to configure longer timeouts, or else revert to a critical auth mode to apply local auth policy when RADIUS unavailable.  To achieve a higher availability for advanced ISE features today (as I cannot speak about roadmap in this forum) would be to fail over to a separate deployment.  This could be a backup ISE instance at the data center intended to take full load, or only the sporadic cases to accommodate the events impacted during the ~15 minute cutover.  It is also possible to deploy a local standalone server to handle all broader failure cases such as complete WAN outage.

Craig