Office365 delivery bottlenecks vs gmail

As we migrate more workloads from Postfix to KumoMTA, we’re increasingly seeing queue depth increase for Office365 and take hours to drain. We send a very similar volume of email to Gmail accounts, but the Gmail provider is not affected. I was hoping for some help understanding why that is and what the best approach is to solve it.

Increasing provider_connection_limit has improved the situation somewhat, but Office365 sends still lag behind. I’ve seen this thread which describes a similar issue, but it doesn’t explain why there is this discrepancy.

My theory is that the problem is related to Office365 using individual MX records for each tenant, which results in separate queues per tenant. In our case, the vast majority of tenants only receive a single message. So for each email send, DNS has to be resolved and a connection opened, which adds overhead. By comparison, Gmail uses a common set of MX records. If that understanding is correct, then with the default shaping settings of:

["default"]
connection_limit = 10
max_connection_rate = "100/min"
max_deliveries_per_connection = 100
max_message_rate = "100/s"

I concurrently can send batches of 10 * 100 = 1,000 deliveries. Meanwhile, even with an increased Office365 provider connection limit of:

[provider."office365"]
match=[{MXSuffix=".mail.protection.outlook.com"}]
max_deliveries_per_connection = 100
provider_connection_limit = 50
max_connection_rate = "100/min"
max_message_rate = "100/s"

I will realistically be sending 50 * 1 = 50, which is a huge difference.

Does that make sense? If so, does it mean I should be increasing the connection limit x 20? I noticed the default rules actually suggest a lower provider_connection_limit of 5 for Office365, which threw me off. I also haven’t been able to find many posts online discussing these differences, so I’m unsure. I definitely don’t want to risk Microsoft rate limiting or reputation issues by opening too many connections. Is rollup a solution here?

My understanding is that delivery to Microsoft right now (as in the last few weeks) has been unpredictable at best. It’s not just you.

Generally, the shaping rules are Per-Source, so if you have 10 outbound IPs ( sources), then you have a setting of connection_limit=10 for Microsoft, then you will be able to open a total of 10*10 = 100 connections across those IPs to Microsoft.

The “provider_” settings put a cap on shaping for the entire provider class, so in the case above, you are setting a cap on connections to 50 for ALL domains that roll up to .mail.protection.outlook.com
This means that if you are sending B2B messages to 100 users with different vanity domains that roll up to office365, only 50 of them will get an open connection and the rest have to wait.
Personally, I would avoid the provider settings unless you know what you are doing. We put them in the default config to prevent people from instantly burning their IPs.

[provider."office365"]
match=[{MXSuffix=".mail.protection.outlook.com"}]
max_deliveries_per_connection = 100
provider_connection_limit = 50
max_connection_rate = "100/min"
max_message_rate = "100/s"

To this:
My theory is that the problem is related to Office365 using individual MX records for each tenant, which results in separate queues per tenant.
KumoMTA automatically rolls up anything it resolves (unlike PMTA). So if bobsfish.com and myshoebox.net both roll up to the same MX, we will use that for both. The problem is that sometimes they don’t, so you can force it with the provider syntax.
The queues will still show as a bunch of single queues with only one message because of the way we create queues.

What you probably want to do is increase or remove the provider setting for office domains entirely.

I would also highly recommend using the resolve-shaping-domain tool to test shaping to a specific destination so you can see the complex overlays.

Thank you so much for explaining Tom!

We didn’t have any provider-specific rules to start with, but in the last couple of weeks we experienced delays and output of kcli queue-summary was full of:

connection_limited: acquiring connection lease shaping-provider-office365-…

I assumed the issue was on our end? I ended up increasing the connection and connection rate limit which helped significantly but also coincided with increased rate limiting for Outlook.

You should absolutely use the resolve-shaping-domain tool on some of those domains that report that “acquiring connection lease” message. You might find that some of the built-in or community rules are doing something you are not expecting.

Note that provider connection limit doesn’t replace the connection limit, it applies one at a different scope. When you set it to 50 you didn’t open up the limit beyond 10 for a given pathway, you just limited the aggregate of the matching pathways to 50 overall.