Elevated Platform Errors

Incident Report for WorkOS

Postmortem

Date: August 21, 2025

Duration: 17:26 - 19:27 UTC

Status: Resolved

Summary

On August 21, 2025, our platform experienced elevated errors and high latency across several services, including sign-in and API requests. The disruption was caused by network link saturation between one of our upstream providers and our cloud hosting environment. This particular issue with our network provider created a widespread disruption of internet traffic.

During the incident, some customers encountered difficulties logging in with username/password or SSO, along with delays and intermittent errors in the API, Dashboard, Admin Portal, and Directory Sync. While the extent varied over time, a meaningful portion of requests were degraded or failed entirely.

What Happened

The disruption stemmed from connectivity issues between our upstream network provider and our cloud infrastructure provider. This caused elevated latency, timeouts, and service unavailability until traffic conditions were stabilized.

Timeline

17:26 UTC – Elevated errors observed and internal incident created

17:47 UTC - Root cause identified as a network issue with our upstream provider

17:52 UTC - Team begins work on mitigation of secondary impacts due to increased number of open requests, while monitoring network conditions

19:27 UTC - Problem with upstream network provider is resolved and response times return to normal

Remediation

Our engineering team investigated multiple paths to remediation while closely monitoring the upstream provider’s recovery efforts. Service performance returned to normal once network stability was restored, after which we transitioned to monitoring to ensure ongoing stability.

To improve resilience and reduce the likelihood of similar issues in the future, we are:

  • Improving observability at the network boundary between our infrastructure and upstream providers.
  • Expanding our monitoring and alerting to more quickly detect issues with our upstream network.
  • Evaluating architectural changes, including multi-region deployment and reducing reliance on single external providers, to further enhance reliability.

Conclusion

We regret the significant impact this incident had on you and your customers. We are committed to implementing lasting improvements to ensure greater stability going forward, and will keep you updated on our progress.

Posted Aug 26, 2025 - 19:44 EDT

Resolved

The incident has been resolved.
Posted Aug 21, 2025 - 16:02 EDT

Monitoring

Our systems are returning to normal. We’ll continue monitoring closely with our upstream provider.
Posted Aug 21, 2025 - 15:29 EDT

Update

We're continuing to investigate on our side, but our edge networking provider has confirmed problems in the US East region. We’re working with them for updates as they resolve it.”
Posted Aug 21, 2025 - 15:11 EDT

Update

Our platform is experienced elevated errors and latency platform-wide. We're still investigating the issue, and will provide an update soon.
Posted Aug 21, 2025 - 13:59 EDT

Investigating

AuthKit is experiencing elevated platform errors. We're currently investigating the issue, and will provide an update soon.
Posted Aug 21, 2025 - 13:39 EDT
This incident affected: Supporting Services (Dashboard, Admin Portal) and Core Services (SSO, Directory Sync, AuthKit).