Major platform outage
Incident Report for Kameleoon
Postmortem

An instability in our platform was detected on Sunday, which led our team to run an unplanned update to remedy the situation. Unfortunately this upgrade did not go smoothly and led to a major incident taking the platform down. The upgrade had to be reverted to restore availability.
The impacted components (and their incident windows) were :

  • Web application : from 12/11, 10pm CET to 13/11, 4am CET
  • JavaScript files : from 13/11, midnight to 13/11, 4am CET

Our post mortem analysis pointed out key areas of improvement :

  1. updating the Kameleoon <-> CDN configuration to cache the javascript files for 24 hours rather than 1 to 2 hours
  2. updating the change management process to better handle maintenance windows in non working hours

We plan to detail these improvements as soon as they are implemented

Posted Nov 14, 2023 - 17:42 CET

Resolved
A serious outage in our platform was detected at 2:00 CET and solved at 5:25 CET this morning. Because of it, the web application, JS delivery and data collection were impacted resulting in experiments not being displayed for your visitors.
This impacted more strongly customers with synchronous implementations of our script as it blocked the loading of their websites.
We are currently pushing a hotfix to mitigate the impact should this specific incident occur again and are carrying out a post mortem to identify key improvement areas to prevent such incidents in the future.
This report should be available by 14/11
Posted Nov 13, 2023 - 02:00 CET