An instability in our platform was detected on Sunday, which led our team to run an unplanned update to remedy the situation. Unfortunately this upgrade did not go smoothly and led to a major incident taking the platform down. The upgrade had to be reverted to restore availability.
The impacted components (and their incident windows) were :
Our post mortem analysis pointed out key areas of improvement :
We plan to detail these improvements as soon as they are implemented