Ready queue kind of stuck

Hi ! I’m having some trouble with a ready queue that is kind of stuck, I don’t really understand where it’s coming from. Some messages are going out but not all, this queue did go through some tsa suspension, and I have these messages on tsa-daemon at the time of the first increase

Apr 19 14:45:35 kumomta-01-970 tsa-daemon[1283202]: 2024-04-19T12:45:35.961750Z ERROR localset-2 tsa_daemon::http_server: error in websocket: IO error: Broken pipe (os error 32): IO error: Broken pipe (os error 32): Broken pipe (os error 32)```

Hey there @clever-impala, thanks for posting. Please read the “Troubleshooting” and “How to Ask for Help” buttons below. If you would like a 1:1 support session from the KumoMTA team, details are at the “Book a Support Session” button below.

Can you share what the queue-summary looks like for this particular ready queue?
Also, do you have samples of any TransientFailure records for it?

queue-summary gives me: smtp-in.orange.fr COG74.11 smtp_client 0 0 0 12354

The TSA-daemon stuff should be completely independent of the throughput of SMTP. The broken pipe messages there imply that the new realtime suspension client disconnected from the daemon, which is probably OK and likely doesn’t need to be logged as an error, assuming that it reconnects again. Do the timestamps match up to config changes being made on the MTA?

Are there any suspensions currently active for it? kcli suspend-ready-q-list

kcli suspend-ready-q-list gives me an empty array

ah, it start of normal, tsa suspends the queue and messages stacks but it don’t get why when it’s not suspended anymore, it doesn’t send

Might be interesting to get some diagnostics from the ready_queue module. You can change the filter without restarting the server like this:

kcli set-log-filter 'kumod=info,kumo_server_common=info,kumod::ready_queue=trace'

then see if anything relevant comes up in the journal. If your server is busy, there could be a lot going on there.

You can restore the default filter:

kcli set-log-filter 'kumod=info,kumo_server_common=info'

One thing that I’m curious about is why there are 0 connections for that particular ready queue.
What’s supposed to happen is that the maintainer task (which wakes up every minute, and is triggered when messages are placed into the ready queue) is supposed to compute a target number of connections based on the size of the ready queue, and then start making connections until the goal is met

I’ve set the logging to trace and I have
kumod::ready_queue: maintain COG74.11->smtp-in.orange.fr@smtp_client: there are now 0 connections, suspended=false
but when I send a new message to the queue with swaks, I do have the
kumod::ready_queue: spawning client for COG74.11->smtp-in.orange.fr@smtp_client
and it’s delivered

it’s possible that this might just be a stats bug; the ready queue size metric is recorded through deltas, so it’s potentially possible to miss updating the count in some case and end up with a skewed counter. If that is the case, then there aren’t really any messages in the ready queue. Restarting the MTA would “resolve” this until it happened again, but we wouldn’t know if that was really the problem. Is it feasible to analyze the logs for that queue to see what the count of Receptions vs. (Delivery + Bounce) looks like, and compare with the ready queue + scheduled queue counts shown in the queue summary?

After looking at the logs, it doesn’t look like the metrics are wrong, I have for the 19th

Delivery: 693
Bounce: 0
Diff: 1463
ready_queue metric: 1463```

I’m on kumod 2024.04.09-e81a5fb6

Could you try running with this change; it should help to understand what the maintainer is doing:

diff --git a/crates/kumod/src/ready_queue.rs b/crates/kumod/src/ready_queue.rs
index a4ce74ce..8034a91d 100644
--- a/crates/kumod/src/ready_queue.rs
+++ b/crates/kumod/src/ready_queue.rs
@@ -430,10 +430,12 @@ impl ReadyQueue {
         // Prune completed connection tasks
         self.connections.retain(|handle| !handle.is_finished());
         tracing::trace!(
-            "maintain {}: there are now {} connections, suspended={}",
+            "maintain {}: there are now {} connections, suspended(via config)={}, suspended(admin)={}, queue_size={}",
             self.name,
             self.connections.len(),
-            self.path_config.borrow().suspended
+            self.path_config.borrow().suspended,
+            AdminSuspendReadyQEntry::get_for_queue_name(&self.name).is_some(),
+            self.ready_count(),
         );

         if self.activity.is_shutting_down() {

I’ll get the latest source and run the build with your changeset, I’ll let you know if I find anything else

I have a couple more bits of debug to add; I’ll just run through the tests and push it

alright, thanks

do I get the symbols for lldb if I build it myself with cargo build --release ?