The SRE team executed a planned deployment on the evening of Tuesday, Oct 18. This event triggered a security policy that forced a key rotation. The key rotation process failed and went undetected. However, the services were left in a consistent state.
The initial planned key rotation failed and went undetected during the deployment. This set the stage for an unplanned key rotation triggered by the cloud provider over 24 hours later. This key rotation process that was triggered only completed a single step in the full key protection process defined by the security policy. This left the system in an inconsistent state.
An automated alert immediately notified the SRE team once the applications failed to access their persistent storage as they had old credentials.
The failure of the key rotation security policy did not alert the SRE team that the policy had not been completed successfully. If this had happened, the issue could have been completely averted. The mechanism for alerting was misconfigured and, as such, did not trigger.
Also, an issue with the cloud provider's automatic key rotation mechanism caused a portion of the key rotation security policy to trigger and leave the system in an inconsistent state.
The team logged into the service provider and restarted the database services, which allowed the key rotation to succeed. This restored all of the services. The key rotation security policy trigger has been reconfigured to be executed manually during release operations.