Messages stuck in the spool after a ready queue bulk delay

Hi, I’m not sure if we ran into a bug or if we’re overlooking something.

We’re running KumoMTA as a smarthost service in front of PowerMTA. On Wednesday we hit PowerMTA’s inbound connection limit under high load which triggered bulk delaying of several ready queues.

Yesterday we received a complaint from a tenant suspecting that some of their messages failed to deliver. After digging around in logs we discovered some of the tenant’s messages having received only Reception but missing Delivery events:

{
  "type": "Reception",
  "id": "f908c846d5c011f0a2e29600041f590a",
  "event_time": "2025-12-10T12:08:43.707233876Z",
  "created_time": "2025-12-10T12:08:43.702483800Z",
  "num_attempts": 0,
  "nodeid": "48ab41d6-f2d1-4667-acdd-42ac38c6c870"
}

We couldn’t find anything else anomalous, so we opted to restart KumoMTA nodes just in case, which dispatched pending messages immediately:

{
  "type": "Delivery",
  "id": "f908c846d5c011f0a2e29600041f590a",
  "event_time": "2025-12-11T08:28:24.170357304Z",
  "created_time": "2025-12-10T12:08:43.702483800Z",
  "num_attempts": 1220,
  "nodeid": "48ab41d6-f2d1-4667-acdd-42ac38c6c870"
}

The example Delivery record shows 1220 retry attempts (we aggressively retry for smarthosting), so the system was attempting to deliver the message but failed. We don’t log Delayed events and thus don’t know the internal reason for the delay, nor has the system logged any other events for these messages.

What could have happened here? Why did these messages get stuck in the spool and failed to dispatch? Why did restarting KumoMTA dispatch these messages immediately? Isn’t KumoMTA supposed to auto-recover from the bulk delayed state?

We use TSA and Redis throttling. Could the traffic shaper have throttled down far enough to effectively block deliveries? Or is traffic shaping not relevant in this context? We logged ~23k bulk delaying events over a 2 minute period.

We’re running KumoMTA version 2025.10.06.

Thanks!

So two things to start with:

  1. You can’t see what you don’t log, and now you’re in here asking about what the delays could have been. Turn that logging back on.
  2. IIRC there was someone who got resolution on this via updating. You’d do well to update to the latest.

I’ve had issues like these too and it was caused by connection limits to providers.

What does your kcli queue-summary show?

Having your full configs and logs would be helpful too (as mentioned above)

I spent the whole day yesterday attempting to replicate the issue in the bench in order to log something more actionable, but with no luck.

The stripped down KumoMTA configuration is available here: KumoMTA configuration for debugging the stuck messages issue. · GitHub. The most interesting parts are likely

Everything else is basically message metadata juggling.

The kcli queue-summary command showed nothing for the campaign in question, this was the first thing we checked. Restarting the impacted node recreated scheduled queues for pending messages as these popped up in the kcli queue-summary view, and as is also illustrated by KumoMTA’s metrics. Notice the small blip at around 10:30 immediately after the restart in the attached screenshot.

This implies some messages either never made it into the scheduled queue or KumoMTA reaped scheduled queues containing messages. The provider connection limit could be related here somehow considering Bjarn also has run into the same issue, and we are shoving a considerable number of messages through 35+5 connections per KumoMTA node.

Anyway, we’ll start with upgrading KumoMTA and go from there. If we run into the same issue again, we’ll take a more thorough look at the problem.

I also see a lot of this:
due to memory shortage, and will requeue 0 due to hitting constraints
Can you please describe the system you are running on? (RAM, CPU, Storage, etc)

We are running on 8 dedicated vCPU, 32 GB memory VSP-s using local NVMe SSD-s without dedicated volumes for the spool. KumoMTA’s memory has been capped by Systemd’s MemoryMax option to ~27 GB to leave some headroom for OS and sidecar services. During that load spike we saw ~1.3 million 200-250 kiB messages ingested over the period of 45 minutes in total with ~31k scheduled queues reported by the node hitting memory limits.

What OS are you running on?

AlmaLinux 9.7

Thank you.