On Thursday, June 12 from 18:02 UTC to 19:56 UTC AuthKit, Dashboard, Admin Portal, SSO, and their APIs were unavailable for some customers. The outage lasted 1 hour and 56 minutes.
Customers can opt to serve WorkOS products from custom domains to maintain a consistent user experience. The custom domain routing infrastructure relies, in part, upon Cloudflare KV. This service was at the center of a widespread outage at Cloudflare.
We identified the issue and deployed a mitigation as the upstream outage continued. We recognize that an outage this severe is unacceptable and we strive for full resilience in the face of inevitable outages in dependent services.
WorkOS offers the use of custom domains for a number of its products: AuthKit, SSO, Admin Portal, and its APIs. Cloudflare Workers are used to perform routing logic for these custom domains. For a subset of these custom domains, we persist a lookup table from custom domain to customer in Cloudflare KV.
Due to the June 12 Cloudflare outage, these lookups failed. Because the Cloudflare dashboard was also impacted, we did not have immediate visibility into the exact cause of failures in Cloudflare’s platform. It wasn’t until Cloudflare posted details on affected services that we were able to identify the failing dependency and develop mitigations.
Cloudflare Workers and Cloudflare KV are in the critical path for our services. We are developing redundancies for outages of these dependencies.