Webhook events not delivered fast enough

Ah check! Will take a closer look at this.

Regarding load, it’s fairly low, so definitely room to increase it. However I did not see Delayed logs, so wasn’t sure

I think these are all for this message:
message.txt (5.76 KB)

And you’re logging everything? I’m wondering if there were some delayed events.

Yep:

        log_dir = "/var/log/kumomta",
        max_segment_duration = "1 minute",
    })```

Got these logs from the tailer tool.

In the Delivery (to gmail) log, it also shows num_attempts: 0, if I understand the docs correctly this should be incremented if it has been Delayed, not sure

1763651143 - 1763650814 = 329 seconds.
That is about 328 more seconds than it shoud be taking. DO you have a lot of message processing happening on the body? DB lookups? File reads?

Not much really, we get DKIM keys via an internal HTTP endpoint that doesn’t leave the private network, avg of 50ms and p99 peak at 150ms. This is memoized but maybe I can tweak its duration and capacity.

I do notice I use 1.1.1.1 and 8.8.8.8 instead of a local dns resolver for Kumo, might have to check top during peak times again tomorrow.

It’s not happening a lot, like maybe once every 2 hours we notice a TTI test to peak, only during the busy hours.

I have raised the max_ready to 2048 just to see what happens tomorrow. Got plenty of headroom left.

Other than this nothing much, prepending a couple headers. Most processing is done in our application.

Just thinking out loud hear haha, will see if it worked tomorrow, otherwise I’ll dive into these things too.

Thanks so far for the help btw! Learning more and more :raising_hands:

Always always always use a local caching resolver. Always.
Did I mention always ?
Email is DNS intensive and you will be throttled.

Yup I completely forgot we did this as part of seeing why we experienced resolving issues with the nameservers of our provider :man_facepalming: Stupid, we run local caching on each node these days so I should’ve used that. Just made the change, so I’ll share the results here soon :grinning_face_with_smiling_eyes:

Sad, still happening. I just dove into it a bit more and it seems to be happening more frequently than I thought.

I checked top in the meantime and the only one that stood out was the latency for delivery over smtp, but I guess that is totally fine.

I can’t compare it to our previous situation, but there’s 4-5ms network latency between the cluster and the SOCKS5 proxy. Moving it closer to the cluster isn’t easily done sadly due to the BGP announcements. But I don’t think that’s the problem.

I’ll see if I can find anything in the meantime when it comes to queues. Diving into it this deep is new territory haha

FWIW, those shaping settings are fairly restrictive. No more than 100 connections per minute, with no more than 50 messages per each one of those connections.

The connection constraints effectively limit the max throughput to 100/min * 50 = 5000/min which is approx. 80/sec.

If you are injecting faster than that rate then you will get queue build up; the bigger the difference, the larger the build up.

['default']
connection_limit = 2
max_connection_rate = "100/min"
max_deliveries_per_connection = 50
max_message_rate = "100/s"

I’m starting to understand them better by the day thanks to discussions like this, thanks! I’m playing around with them to see how we aren’t flooding the MBPs all at once but while maintaining fast delivery. I was quite hesitant to raise them at first

The community shaping rules provided are quite good if you have good quality mail.

I indeed took influence from those when we started out, but lowered them a bit at first.

If you find things that don’t align or are missing a PR is always welcome and appreciated.

Definitely will do! Currently still trying to figure out where the delay comes from, I also found an instance for another queue, in this case outlook.

I don’t think the shaping limits are the issue, reason for that is that at the time of the spikes in the delivery time, there’s not more messages than usual, let’s say 150 over 2 minutes across all queues. Even with 80/sec we should get 4800msg/min.

CPU/Memory is also really low, even with the raised max_ready value it doesn’t get over 250MB.

I’ll go check if I can get any useful debug logging

Interesting finding (I am just dumping my thoughts here if that’s OK, rubberducking works amazing and maybe someone spots anything).

An email with a Reception event at 16:15:54 got delivered at 16:20:54. I see the bind request to the proxy was only done at …yep, 16:20:53.

I do see trace logs at 16:15:55 when the email was being submitted to Kumo. Then at the end of submission, the Reception is logged.

I reenabled trace logs again, I think this might show some more (disabled them after a while because they were slowing down query duration in Loki haha)

This is an email of our own, there’s a ~2 minute gap between reception traces (and followed by webhook deliveries) and nothing related to this message-id directly was found. I’m now going through the logs between these 2 minutes to see if I find anything.

Looking at the logs, not much happening until 15:52:33.