Webhook events not delivered fast enough

tom · November 21, 2025, 4:32pm

Have you fixed DNS yet?

Bjarn · November 21, 2025, 4:34pm

Yep, that’s using the node local dns resolver now, without unbound:

        name_servers = {
          '169.254.20.10:53',
        }
    })```

Areeb · November 21, 2025, 11:02pm

This means that each IP address can have up to 2 simultaneous SMTP connections to that destination. Yes?

wez · November 22, 2025, 6:43am

yes, but the connection rate still limits the throughput as mentioned above

Bjarn · November 22, 2025, 10:18am

That really can be the bottleneck with our volume? Because those defaults should be overridden by:

max_deliveries_per_connection = 30
provider_connection_limit = 10
max_message_rate = "200/s"
max_connection_rate = "200/min"
max_ready           = 2048

And that was when it happened still during those logs.

If I calculate correctly it’s 300*200/min. that is far far above our volume

It doesn’t happen when traffic is low, no issues last night for example.

wez · November 22, 2025, 2:54pm

TBH, I’m not really following this thread very closely. I mentioned the above as something you need to consider for your overall configuration, as seemingly strange delays can just be that the system is enforcing the constraints you specified.

Bjarn · November 22, 2025, 5:39pm

Yea no worries at all! Appreciate all the help I’m getting while searching

Bjarn · November 22, 2025, 5:41pm

If I understand that correctly it means that if the configuration forces a delay, it won’t necessarily cause a Delayed event (and thus not always write to spool)?

Also, scheduled queue is not necessarily related to writing to spool right?

wez · November 22, 2025, 7:04pm

if messages are throttled in ready queue due to shaping constraints, and the ready queue is not full, then they can sit in the ready queue until the throttle opens up and allows the queue maintainer to make progress, and won’t generate a Delayed event. The Delayed event is logged when moving messages back to the Scheduled queue, either because the ready queue is full or because a transient response was returned from the destination.

Spool is written to only once during reception, but if you’ve implemented lua events that modify the message or its metadata, you will trigger writes to the spool when the message is next moved to the Scheduled queue.

wez · November 22, 2025, 7:05pm

we try to be smart about moving messages back to the Scheduled queue if the throttle is long enough to warrant it, so that things aren’t silently lingering

wez · November 22, 2025, 7:06pm

I’d suggest looking at delayed_due_to_ready_queue_full, delayed_due_to_message_rate_throttle and delayed_due_to_throttle_insert_ready metrics. Ideally those are all zero, but they might be non-zero in your case, and suggest what to look at next

Bjarn · November 22, 2025, 8:16pm

Amazing, appreciate this! Thanks

Bjarn · November 24, 2025, 6:23pm

Wanted to post an update here, after lots of tinkering I found our sweet spot I think! Mostly increased max_connection_rate and that definitely did the trick while not increasing transient failures.

Really want to share my appreciation to all the help I was getting here, thanks a lot.

Bjarn · December 1, 2025, 11:22am

Back again with the same issue haha, but wanted to quickly check.

We again increased the connection rate, but are seeing that mails to MS365 are stuck in the ready queue because of:
connection_limited: acquiring connection lease shaping-provider-office365-t-ip-1-limit @ 2025-12-01 11:14:26.186226304 UTC

However, I did increase the limit a lot:

match=[{MXSuffix=".mail.protection.outlook.com"}]
max_deliveries_per_connection = 30
provider_connection_limit = 25
max_message_rate = "500/s"
max_connection_rate = "2000/min"
max_ready           = 3072```

No bounces/deferrals, just waiting in the ready queue. If I calculate this correctly it should handle 60.000 mails per minute per IP, though we never hit that (we're at a couple K per hour at most).

TSA is empty, so all good there: 
```# Generated by tsa-daemon
# Number of entries: 0```

Bjarn · December 1, 2025, 11:59am

Increased it a bit more and all solved again, but I think I don’t quite yet understand the calculations good enough yet.

Jack · December 2, 2025, 12:46am

The issue lies with provider_connection_limit, not rate. If you have a large number of Office 365 emails to send and exceed 25, this problem will occur.

Areeb · December 2, 2025, 4:38am

Is that 25 connecions at a time?

Bjarn · December 2, 2025, 7:11am

I was also wondering that, because that is how I understood it as well, that the provider_connection_limit sets the concurrent limit per source IP to a provider.

Jack · December 2, 2025, 7:13am

Yes, it’s at the provider level.
However, in my production environment, although the daily sending volume is quite high, this issue has never occurred. I just checked — My configuration file uses the default value of 25.