Sadly we seemed to experience a load issue today. We noticed “load shedding” responses, and upon checking, a backlog of webhooks has built up.
At the same time, I also notice there aren’t even that many requests coming in to our internal API. It seems to be stuck at 20-30 msgs/sec and sometimes even close to 0.
mx_rollup = false
connection_limit = 100
max_deliveries_per_connection = 100000
max_connection_rate = "1000/s"
max_message_rate = "200/s"```
```local log_hooks = require 'policy-extras.log_hooks'
log_hooks:new_json {
name = 'webhook',
url = 'http://api/xxxx',
log_parameters = {
headers = { 'Subject', 'xxxx', 'xxxx', 'xxxxx', 'xxxxx' },
meta = { 'subject', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'x-original-message-id'}
},
}```
Does anyone have any recommendations on what went wrong here? The total backlog sum is 2 million atm 😅
ah, just figured that the webhook queue was so big it was delaying the messages (therefore that message and this one: Context: DueTimeWasReached, ReadyQueueWasFull. Next due in 4s 999ms 998us 30ns at 2025-11-02T06:19:51.222420297Z
This caused compounded the amount of webhooks on the backlog. I forgot to disable Delayed webhook events for this endpoint.
Is it possible to clear only this specific event from the queue somehow?
Setting max_ready for webhooks in the shaping.toml doesn’t seem to affect anything, I am probably missing something. Supposing that is the right setting I should change to get it to flow a bit faster considering after the ready_queue is full it waits up to 60 seconds before filling it up again.
Even though max_message_rate is set to 200/s, it caps at 20~
i have create this file for batch processing https://pastebin.com/Lkz5KNf2
but the issue is email_specter do not accepting Webhook Endpoint to Arrays data, it is built for single process, anyone tried it for batch process ?
To quickly come back to this, we have batched webhooks to Postmastery for example, however, the ready queue is constant at 1042 (it’s basically flat) and the scheduled queue depth keeps increasing.
It’s weirdly enough only happening for 2 MTAs of the 3 we run atm. What could be the culprit in this case?
The max_ready default is 1024 and you have not modified it in any of the configs.
This is a good setting for a typical email, but logs are pretty tiny (1kb?) so allowing for many more in the ready queue is ok.
You may want to set the max_ready for the Postmastery webhook to something higher. Try 10,000 to start and see if it makes a difference. Increase it experimentally, but I have seen 40k work for me in the past. Note that I would NOT do that for a typical email domain.
Trying to understand the ready queue better here. Basically we have seen higher volumes since a few days, all fine. However, sometimes, an email sent takes 5 minutes between Reception and Delivery, while there’s no TransientFailure.
I don’t think the max_ready is the issue when looking at the charts (though I am not sure if we should raise it if it was the case?).
I have an example of a Delivery log (removed some values):
"type": "Delivery",
"id": "9e618794c62111f08f7706a713957eb1",
"queue": "xxx-0:xxx@gmail.com",
"site": "t-ip-2->(alt1|alt2|alt3|alt4)?.gmail-smtp-in.l.google.com@smtp_client",
"size": 3791,
"response": {
"code": 250,
},
"peer_address": {
"name": "gmail-smtp-in.l.google.com.",
"addr": "66.102.1.26"
},
"event_time": "2025-11-20T15:05:43.540837407Z",
"created_time": "2025-11-20T15:00:14.189762Z",
"num_attempts": 0,
"feedback_report": null,
"meta": {},
"headers": {},
"provider_name": "google",
}```
I am not really sure where I should look to investigate where the 5 mins between created_time and event_time (of delivery) comes from.
When investigating things like this, what should catch my attention most?
I would take a look at kcli top to monitor latency and load. If it is taking a long time to go from injection to first delivery, then your max_ready is likely full.