Sometimes Kumo stops delivering some webhooks, and system restart kumomta hangs

I’ve been noticing that sometimes Kumo stops delivering webhook events for some messages, I receive the reception webhook and that’s it. I can’t find any trace of the message anywhere in the logs after the reception. In the kumo logs I also only see two logs related to the message: one reception and one lua delivery (for the delivering the reception event to the webhook).
One example of this (from the kumo logs):

These are the only two log entries related to this message in the logs.

This usually happens other week or so, and once it starts, it gets more and more frequent with every passing day. Whenever it happens, I restart kumomta and the issue goes away for a couple of weeks.

The other problem is that when this issue is happening and I try to restart kumomta, the systemctl restart kumomta command just hangs for 5 minutes, and I guess finally gets killed when it reaches the systemctl stop timeout (TimeoutStopSec=300). I’ve attached the backtrace from lldb -p $(pgrep kumod) -o 'bt all' -o 'quit' > /tmp/kumo-bt.txt below.

I’m running version 2024.04.20-4272b4a3 but this has also happened on other / older versions as well.
kumo-bt.txt (25 KB)

Hey there @original-baboon, thanks for posting. Please read the “Troubleshooting” and “How to Ask for Help” buttons below. If you would like a 1:1 support session from the KumoMTA team, details are at the “Book a Support Session” button below.

Anything in the syslog?

Try a newer version? fix: ready queue gets stuck when TSA suspends before MAIL FROM · KumoCorp/kumomta@850b64d · GitHub fixed an issue where messages could get stuck. If you’re not suspending in mail-from (the specific issue that commit resolves), it’s conceivable that other similar triggers for that condition might also have existed.

Thanks! Upgraded to the latest dev version, I’ll let you know if this happens again in the next couple of weeks.

OK, I’ve noticed this happening again. This time I see the email reception and delivery in the logs, but they have different IDs, restarting kumo fixes it, kumo hangs when I call systemctl restart kumomta until TimeoutStopSec (5 minutes) is reached and the process gets killed.

{"type":"Reception","id":"b77b2fde2da311efaaa3960002cafe7c","sender":"b77b2fde2da311efaaa3960002cafe7c@psrp.kumo.taskulu.com","recipient":"arnooshshayan650@gmail.com","queue":"4cdd7bdd-294e-4762-892f-83d40abf5a87@gmail.com","site":"","size":59361,"response":{"code":250,"enhanced_code":null,"content":"","command":null},"peer_address":{"name":"taskulu.com","addr":"207.154.194.101"},"timestamp":1718736665,"created":1718736665,"num_attempts":0,"bounce_classification":"Uncategorized","egress_pool":null,"egress_source":null,"feedback_report":null,"meta":{"tenant":"4cdd7bdd-294e-4762-892f-83d40abf5a87"},"headers":{"Subject":"اطلاعیه‌های جدید شما در تسکولو"},"delivery_protocol":null,"reception_protocol":"ESMTP","nodeid":"2f6489ba-9d9d-47d5-83bb-8492ac3fdd93"}
{"type":"Delivery","id":"b77e677f2da311efaaa4960002cafe7c","sender":"b77b2fde2da311efaaa3960002cafe7c@psrp.kumo.taskulu.com","recipient":"arnooshshayan650@gmail.com","queue":"webhook.log_hook","site":"unspecified->webhook.log_hook@lua:make.webhook.log_hook","size":1023,"response":{"code":200,"enhanced_code":null,"content":"200 OK: OK","command":null},"peer_address":{"name":"Lua via make.webhook.log_hook","addr":"0.0.0.0"},"timestamp":1718736665,"created":1718736665,"num_attempts":0,"bounce_classification":"Uncategorized","egress_pool":"unspecified","egress_source":"unspecified","feedback_report":null,"meta":{"tenant":"4cdd7bdd-294e-4762-892f-83d40abf5a87"},"headers":{},"delivery_protocol":"Lua","reception_protocol":"LogRecord","nodeid":"2f6489ba-9d9d-47d5-83bb-8492ac3fdd93"}

Did lldb -p $(pgrep kumod) -o 'bt all' -o 'quit' a few times during the hang.
kumo-bt.txt (26.9 KB)
kumo-bt-2.txt (27.5 KB)
kumo-bt-3.txt (27.5 KB)

kumod 2024.05.17-3f480c94

systemctl restart will not work. You have to stop/start.

The reception and delivery have different message-ids? Are you injecting with a message-id that is getting replaced maybe?

It looks like those are different messages. They are in different queues.

It looks like it is trying to deliver the webhook to the recipient, which is kinda weird. Can you please share your full current configs?

You’re right - the “Delivery” event is related to the webhook delivery, not SMTP delivery, and it’s normal for it to have a different ID in my setup.

The thing is messages like this just gets stuck - it’s accepted by kumo, I receive the reception webhook, but nothing else happens to them.

Nothing else happens in the logs or at all?

not for this message, I have another example from this morning:

{"type":"Reception","id":"0c68be4c2e2711ef85fe960002cafe7c","sender":"0c68be4c2e2711ef85fe960002cafe7c@psrp.kumo.taskulu.com","recipient":"houbakht@barazaman.com","queue":"4cdd7bdd-294e-4762-892f-83d40abf5a87@barazaman.com","site":"","size":50898,"response":{"code":250,"enhanced_code":null,"content":"","command":null},"peer_address":{"name":"taskulu.com","addr":"207.154.194.101"},"timestamp":1718793071,"created":1718793071,"num_attempts":0,"bounce_classification":"Uncategorized","egress_pool":null,"egress_source":null,"feedback_report":null,"meta":{"tenant":"4cdd7bdd-294e-4762-892f-83d40abf5a87"},"headers":{"Subject":"یادآور سررسید کار سرویس ترب"},"delivery_protocol":null,"reception_protocol":"ESMTP","nodeid":"2f6489ba-9d9d-47d5-83bb-8492ac3fdd93"}
{"type":"Delivery","id":"0c6a5db82e2711ef85ff960002cafe7c","sender":"0c68be4c2e2711ef85fe960002cafe7c@psrp.kumo.taskulu.com","recipient":"houbakht@barazaman.com","queue":"webhook.log_hook","site":"unspecified->webhook.log_hook@lua:make.webhook.log_hook","size":1016,"response":{"code":200,"enhanced_code":null,"content":"200 OK: OK","command":null},"peer_address":{"name":"Lua via make.webhook.log_hook","addr":"0.0.0.0"},"timestamp":1718793071,"created":1718793071,"num_attempts":0,"bounce_classification":"Uncategorized","egress_pool":"unspecified","egress_source":"unspecified","feedback_report":null,"meta":{"tenant":"4cdd7bdd-294e-4762-892f-83d40abf5a87"},"headers":{},"delivery_protocol":"Lua","reception_protocol":"LogRecord","nodeid":"2f6489ba-9d9d-47d5-83bb-8492ac3fdd93"}

These are the only two logs entries I have for this message.

Glad to see a Persian subject line captured and encoded just fine.

Hmm, null egress_pool.

NM this is a reception.

Yeah, I see a reception event, and the lua delivery event, and nothing else

Question is whether the message is still in there. Is that by chance the only one in the Scheduled queue? Does the metrics API show it’s still there?