Summary
On 2024-07-11 at 20:00 (UTC), AuthKit generated invalid SSO and OAuth authorization URLs for customers using a custom authentication API domain. As a result, end-users encountered a 404 (not found) page when attempting to sign in through AuthKit.
Root Cause Analysis
As part of the sign-in flows, AuthKit leverages the WorkOS API (https://api.workos.com). Customers may sign up to use a custom API domain, which serves as an alias for the WorkOS API. On 2024-07-10, we discovered a case where the custom API domain, for those who configured it, wasn’t being used properly in sign-in flows for OIDC connections.
On 2024-07-11 we pushed an update to address this behavior and introduced a more severe issue that generated a malformed authorization URL, resulting in a Not Found page for the user.
Actions and Remediations
- Remediation: Upon learning of the issue, we promptly rolled back the faulty deploy which resolved the problem. Although the rollback was quick once initiated, identifying the issue took significantly longer than acceptable, resulting in ~32 minutes of unavailability for customers with custom API domains.
- Permanent fix: The correct fix for the original issue related to custom API domains was deployed a few hours later.
- Improving Test Coverage: We have identified several areas for improvement in our test suite. We are adding new integration and end-to-end tests, including scenarios for customers with custom API domains configured which will prevent similar issues.
- Enhancing Monitoring: We are upgrading our monitoring tools to better detect request anomalies and perform more sophisticated automated checks that simulate end-user behavior in critical sign-in flows. This should reduce the time to detect issues.
Timeline
- 2024-07-11 20:01 (UTC): faulty code deployed
- 2024-07-11 20:28 (UTC): issue was acknowledged by WorkOS team
- 2024-07-11 20:33 (UTC): rollback deployed
- 2024-07-11 20:47 (UTC): last NotFound error seen (~2% of the total errors occurred between 20:33 and 20:47, which is attributed to the time needed to invalidate the cached page instances at the edge.)
- 2024-07-12 00:31 (UTC): permanent fix was deployed