Elevated Webhook Delays

Incident Report for WorkOS

Postmortem

Summary

On September 10, 2025, an incident occurred causing significant delays in Directory Sync webhook delivery. Customers experienced these delays from 14:54 UTC until resolution at 17:46 UTC. The issue was triggered when a high volume of outgoing webhooks timed out during delivery, which led to repeated failure attempts and consumed available webhook worker capacity.

What Happened

During the incident, a customer generating a high volume of Directory Sync webhooks experienced an outage on their system. This led to timeout errors when we attempted to deliver webhooks to them. These timeouts created a growing backlog, which eventually caused delayed webhook delivery for all customers.

During this period, 26 customers experienced delays exceeding 10 minutes for at least 10% of their webhook volume. The 90th percentile delay across the system reached 49 minutes.

Contributing Factors

  • A high volume of outgoing webhooks were timing out on delivery to a customer system.
  • Timeouts lead to slow failures & repeated delivery attempts, consuming the available capacity of the webhook workers.
  • Shared pool of worker capacity for directory synchronization webhook delivery led to a cascading effect that impacted all directory synchronization webhooks.

Timeline

Time (UTC) Event
14:54 Directory Sync webhooks began experiencing significant delivery delays.
16:50 First alert fired.
17:19 Investigation began, source of delivery failures identified.
17:46 Normal delivery returned.
17:48 Job backlog cleared, customer system recovered.

Remediation

Immediate Actions

  • Investigated and identified the source of delivery failures.
  • Worked to clear the job backlog.
  • Monitored the customer system recovery and webhook delivery.

Future Prevention & Improvements

  • New Webhook Delivery Architecture: We are testing Directory Sync webhooks on a new architecture that will isolate delivery failures, ensuring that issues with one customer do not impact others.
  • Load Shedding Techniques: Identified and will implement load shedding techniques to manage high volumes of webhooks more effectively and prevent system overload.
  • Improved Alerting: Enhanced alerting mechanisms are being developed to allow for faster detection and response to degraded webhook performance.

Conclusion

We understand that this incident had a noticeable impact on our customers and their clients. As a provider of essential enterprise functionality, we strive to provide the best possible service, and will be feeding the lessons learned from this incident back into our system. This ensures a resilient platform you can rely on now and in the future.

Posted Sep 17, 2025 - 11:13 EDT

Resolved

This incident has been resolved.
Posted Sep 10, 2025 - 16:53 EDT

Monitoring

Delays on directory sync webhooks have returned to baseline levels, we are continuing to monitor
Posted Sep 10, 2025 - 13:50 EDT

Identified

We are investigating an issue with our webhooks, which has lead to elevated latency for directory sync webhook delivery.

We apologize for the inconvenience and will share an update once we have more information.
Posted Sep 10, 2025 - 13:39 EDT