Webhook events not delivered fast enough

Bjarn · November 20, 2025, 5:45pm

Ah check! Will take a closer look at this.

Regarding load, it’s fairly low, so definitely room to increase it. However I did not see Delayed logs, so wasn’t sure

Bjarn · November 20, 2025, 5:45pm

I think these are all for this message:
message.txt (5.76 KB)

Mike · November 20, 2025, 5:52pm

And you’re logging everything? I’m wondering if there were some delayed events.

Bjarn · November 20, 2025, 6:03pm

Yep:

        log_dir = "/var/log/kumomta",
        max_segment_duration = "1 minute",
    })```

Got these logs from the tailer tool.

In the Delivery (to gmail) log, it also shows num_attempts: 0, if I understand the docs correctly this should be incremented if it has been Delayed, not sure

tom · November 20, 2025, 7:06pm

1763651143 - 1763650814 = 329 seconds.
That is about 328 more seconds than it shoud be taking. DO you have a lot of message processing happening on the body? DB lookups? File reads?

Bjarn · November 20, 2025, 10:56pm

Not much really, we get DKIM keys via an internal HTTP endpoint that doesn’t leave the private network, avg of 50ms and p99 peak at 150ms. This is memoized but maybe I can tweak its duration and capacity.

I do notice I use 1.1.1.1 and 8.8.8.8 instead of a local dns resolver for Kumo, might have to check top during peak times again tomorrow.

It’s not happening a lot, like maybe once every 2 hours we notice a TTI test to peak, only during the busy hours.

I have raised the max_ready to 2048 just to see what happens tomorrow. Got plenty of headroom left.

Other than this nothing much, prepending a couple headers. Most processing is done in our application.

Just thinking out loud hear haha, will see if it worked tomorrow, otherwise I’ll dive into these things too.

Thanks so far for the help btw! Learning more and more

tom · November 21, 2025, 12:05am

Always always always use a local caching resolver. Always.
Did I mention always ?
Email is DNS intensive and you will be throttled.

Bjarn · November 21, 2025, 8:51am

Yup I completely forgot we did this as part of seeing why we experienced resolving issues with the nameservers of our provider Stupid, we run local caching on each node these days so I should’ve used that. Just made the change, so I’ll share the results here soon

Bjarn · November 21, 2025, 10:50am

Sad, still happening. I just dove into it a bit more and it seems to be happening more frequently than I thought.

I checked top in the meantime and the only one that stood out was the latency for delivery over smtp, but I guess that is totally fine.

Bjarn · November 21, 2025, 11:02am

I can’t compare it to our previous situation, but there’s 4-5ms network latency between the cluster and the SOCKS5 proxy. Moving it closer to the cluster isn’t easily done sadly due to the BGP announcements. But I don’t think that’s the problem.

I’ll see if I can find anything in the meantime when it comes to queues. Diving into it this deep is new territory haha

wez · November 21, 2025, 11:55am

FWIW, those shaping settings are fairly restrictive. No more than 100 connections per minute, with no more than 50 messages per each one of those connections.

The connection constraints effectively limit the max throughput to 100/min * 50 = 5000/min which is approx. 80/sec.

If you are injecting faster than that rate then you will get queue build up; the bigger the difference, the larger the build up.

['default']
connection_limit = 2
max_connection_rate = "100/min"
max_deliveries_per_connection = 50
max_message_rate = "100/s"

Bjarn · November 21, 2025, 1:35pm

I’m starting to understand them better by the day thanks to discussions like this, thanks! I’m playing around with them to see how we aren’t flooding the MBPs all at once but while maintaining fast delivery. I was quite hesitant to raise them at first

tom · November 21, 2025, 2:05pm

The community shaping rules provided are quite good if you have good quality mail.

Bjarn · November 21, 2025, 2:46pm

I indeed took influence from those when we started out, but lowered them a bit at first.

Mike · November 21, 2025, 2:46pm

If you find things that don’t align or are missing a PR is always welcome and appreciated.

Bjarn · November 21, 2025, 2:52pm

Definitely will do! Currently still trying to figure out where the delay comes from, I also found an instance for another queue, in this case outlook.

I don’t think the shaping limits are the issue, reason for that is that at the time of the spikes in the delivery time, there’s not more messages than usual, let’s say 150 over 2 minutes across all queues. Even with 80/sec we should get 4800msg/min.

CPU/Memory is also really low, even with the raised max_ready value it doesn’t get over 250MB.

Bjarn · November 21, 2025, 2:53pm

I’ll go check if I can get any useful debug logging

Bjarn · November 21, 2025, 3:47pm

Interesting finding (I am just dumping my thoughts here if that’s OK, rubberducking works amazing and maybe someone spots anything).

An email with a Reception event at 16:15:54 got delivered at 16:20:54. I see the bind request to the proxy was only done at …yep, 16:20:53.

I do see trace logs at 16:15:55 when the email was being submitted to Kumo. Then at the end of submission, the Reception is logged.

Bjarn · November 21, 2025, 3:48pm

I reenabled trace logs again, I think this might show some more (disabled them after a while because they were slowing down query duration in Loki haha)

Bjarn · November 21, 2025, 4:07pm

This is an email of our own, there’s a ~2 minute gap between reception traces (and followed by webhook deliveries) and nothing related to this message-id directly was found. I’m now going through the logs between these 2 minutes to see if I find anything.

Looking at the logs, not much happening until 15:52:33.