We had identified the root issue and we are working on the background to prevent this issue in the future.
Posted 4 months ago. May 05, 2019 - 21:24 UTC
The service is operational.
The RabbitMQ cluster suffered from a network partition caused by a queue that grow over its limit. We are investigating why the consumers of such queue died over the weekend.
In case of a brain split, the smallest nodeset is taken out of service, but in this case it contained the rogue queue, it's backup was running on another node but rabbitmq complained that it was out of sync.
Restoring the service required to select the most recent data from the nodes and force boostrapping the cluster.
Posted 4 months ago. May 05, 2019 - 20:00 UTC
This incident affected: Web Dashboard, Job execution & data storage, Periodic job scheduling, and Project Deployment.