Webhook events not delivered fast enough

Hi all,

Sadly we seemed to experience a load issue today. We noticed “load shedding” responses, and upon checking, a backlog of webhooks has built up.

At the same time, I also notice there aren’t even that many requests coming in to our internal API. It seems to be stuck at 20-30 msgs/sec and sometimes even close to 0.

mx_rollup = false
connection_limit = 100
max_deliveries_per_connection = 100000
max_connection_rate = "1000/s"
max_message_rate = "200/s"```

```local log_hooks = require 'policy-extras.log_hooks'
log_hooks:new_json {
  name = 'webhook',
  url = 'http://api/xxxx',
  log_parameters = {
    headers = { 'Subject', 'xxxx', 'xxxx', 'xxxxx', 'xxxxx' },
    meta = { 'subject', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'x-original-message-id'}
  },
}```

Does anyone have any recommendations on what went wrong here? The total backlog sum is 2 million atm 😅

Enabling debug logs shows me a lot of messages like this:

ah, just figured that the webhook queue was so big it was delaying the messages (therefore that message and this one:
Context: DueTimeWasReached, ReadyQueueWasFull. Next due in 4s 999ms 998us 30ns at 2025-11-02T06:19:51.222420297Z

This caused compounded the amount of webhooks on the backlog. I forgot to disable Delayed webhook events for this endpoint.

Is it possible to clear only this specific event from the queue somehow?

log_hooks:new {
  log_parameters = {
    per_record = {
          Delayed = { enable = false },
          Any = { enable = true },
        },
...
  }
...
}

I did set that up indeed, however it doesn’t clear the existing queue

Regarding processing speed:

Setting max_ready for webhooks in the shaping.toml doesn’t seem to affect anything, I am probably missing something. Supposing that is the right setting I should change to get it to flow a bit faster considering after the ready_queue is full it waits up to 60 seconds before filling it up again.

Even though max_message_rate is set to 200/s, it caps at 20~

Maybe max_age… will be work or bounce all this queue messages?

It’s 2 more hours until the queue has cleared, so I’ll wait it out. Too much risk of there being non-Delayed webhooks being purged that way haha

It looks like you are using a single record per hook. You may want to consider batching those logs for better performance.

Ah yea! Forgot to update this thread, but we indeed did just that. Queue was cleared almost immediately. Works like a charm :grinning_face_with_smiling_eyes:

Awesome

i have create this file for batch processing
https://pastebin.com/Lkz5KNf2
but the issue is email_specter do not accepting Webhook Endpoint to Arrays data, it is built for single process, anyone tried it for batch process ?

@liberated-alligator This sounds like a new issue, can you please create a new request for this in <#1167076017903501445> ?

To quickly come back to this, we have batched webhooks to Postmastery for example, however, the ready queue is constant at 1042 (it’s basically flat) and the scheduled queue depth keeps increasing.

It’s weirdly enough only happening for 2 MTAs of the 3 we run atm. What could be the culprit in this case?

For example we have:

mx_rollup = false
connection_limit = 1
max_deliveries_per_connection = 100000
max_connection_rate = "1000/s"```

```log_hooks:new {
  name = 'postmastery-webhook',
  batch_size = 500,
  min_batch_size = 100,
  max_batch_latency = "60s",```

```['default']
connection_limit = 2
max_connection_rate = "100/min"
max_deliveries_per_connection = 50
max_message_rate = "100/s"
idle_timeout = "60s"
enable_tls = "OpportunisticInsecure"
skip_hosts = ["::/0"]
remember_broken_tls = "3 days"
opportunistic_tls_reconnect_on_failed_handshake = true
consecutive_connection_failures_before_delay = 10
try_next_host_on_transport_error = true```

The max_ready default is 1024 and you have not modified it in any of the configs.
This is a good setting for a typical email, but logs are pretty tiny (1kb?) so allowing for many more in the ready queue is ok.
You may want to set the max_ready for the Postmastery webhook to something higher. Try 10,000 to start and see if it makes a difference. Increase it experimentally, but I have seen 40k work for me in the past. Note that I would NOT do that for a typical email domain.

Check, thanks Tom! I wasn’t really aware it was ok to change, because like you mentioned with email queues, that might do more harm than good

Trying to understand the ready queue better here. Basically we have seen higher volumes since a few days, all fine. However, sometimes, an email sent takes 5 minutes between Reception and Delivery, while there’s no TransientFailure.

I don’t think the max_ready is the issue when looking at the charts (though I am not sure if we should raise it if it was the case?).

I have an example of a Delivery log (removed some values):

  "type": "Delivery",
  "id": "9e618794c62111f08f7706a713957eb1",
  "queue": "xxx-0:xxx@gmail.com",
  "site": "t-ip-2->(alt1|alt2|alt3|alt4)?.gmail-smtp-in.l.google.com@smtp_client",
  "size": 3791,
  "response": {
    "code": 250,
  },
  "peer_address": {
    "name": "gmail-smtp-in.l.google.com.",
    "addr": "66.102.1.26"
  },
  "event_time": "2025-11-20T15:05:43.540837407Z",
  "created_time": "2025-11-20T15:00:14.189762Z",
  "num_attempts": 0,
  "feedback_report": null,
  "meta": {},
  "headers": {},
  "provider_name": "google",
}```

I am not really sure where I should look to investigate where the 5 mins between created_time and event_time (of delivery) comes from.

When investigating things like this, what should catch my attention most?

There may be some useful tools here for calulating max_ready:

I would take a look at kcli top to monitor latency and load. If it is taking a long time to go from injection to first delivery, then your max_ready is likely full.

What are all the log entries for that ID?