Dashboard, Scheduling and Project Deployment Outage
Incident Report for Zyte
Resolved
We had identified the root issue and we are working on the background to prevent this issue in the future.
Posted May 05, 2019 - 21:24 UTC
Monitoring
The service is operational.

The RabbitMQ cluster suffered from a network partition caused by a queue that grow over its limit. We are investigating why the consumers of such queue died over the weekend.

In case of a brain split, the smallest nodeset is taken out of service, but in this case it contained the rogue queue, it's backup was running on another node but rabbitmq complained that it was out of sync.

Restoring the service required to select the most recent data from the nodes and force boostrapping the cluster.
Posted May 05, 2019 - 20:00 UTC
This incident affected: Web Dashboard and Scrapy Cloud (Scrapy Cloud - Job Execution and Storage, Scrapy Cloud - Project Deployment, Scrapy Cloud - Periodic Job Scheduling).