Head-scratcher here.
A clustered pair of 7205 controllers have been faithfully doing their job for a few years, connecting back to a remote MM through a Site-to-site VPN. They are connected using IP-and-PSK for their authentication, and there haven't been issues with that.
A few days ago, one of the controllers lost its connection to the MM. It shows as "down" in the "show switches" list, but "Up" in Airwave (indeed, the datapath sessions table shows it is communicating with Airwave and also sending some info to the MM (port 6633 (openflow) and another odd port (8822? not 8211)). The device shows as "Update Required" from MM side, and "Master Unreachable/Last Snapshot" from the controller side. I turned on disaster-recovery and then off again, to try to get it to pull down the new config - but it is stuck at ConfigID -1.
I've checked the logs, and the most relevant things I see are about heartbeat timeouts and the IKE tunnel going down due to expiration. In the logs I see that it did establish a tunnel this morning to the MM but a bit later it went down.
I checked the keys are correct, and even reentered them in the MM exactly how they are from the MD.
Any other ideas? There haven't been any material changes to the devices or their config hierarchy node over the last week, and the other controller in the cluster has been connected fine this whole time.
------------------------------
- ryh
------------------------------