Hi, since the upgrade to the latest stable version of KumoMTA, some webhook events are not being delivered, they just get stuck in the queue, some will get processed after restarting kumo, but some will not. Delivery has slowed down in general- especially to Google / GMail where most of our messages go.
root@send:/opt/kumomta/sbin# ./kcli provider-summary | head
PROVIDER D T F C Q
webhook.log_hook 8,186 0 0 0 6,182
(alt1|alt2|alt3|alt4)?.gmail-smtp-in.l.google.com 0 0 0 0 4,500
google 2,044 181 29 4 0
...
Our configuration is valid (the only change I’ve made since since our upgrade from the last stable release is to use the new provider syntax in our shaping files):
root@send:/opt/kumomta/sbin# ./kumod --user kumod --validate
Issues found in the combined set of shaping files:
- /opt/kumomta/share/policy-extras/shaping.toml
- /opt/kumomta/etc/policy/config/shaping.toml
- http://127.0.0.1:8008/get_config_v1/shaping.toml
WARNING: Ignoring http://127.0.0.1:8008/get_config_v1/shaping.toml because skip_remote is set to true
log_hooks:new_json {
name = "webhook",
-- log_parameters are combined with the name and
-- passed through to kumo.configure_log_hook
log_parameters = {
headers = { 'Subject', 'From', 'Message-Id', 'References', 'In-Reply-To' },
meta = { 'tenant', 'domain_id', 'customer_id', 'direction', 'metadata_retention', 'data_retention', 'tags' }
},
-- queue config are passed to kumo.make_queue_config.
-- You can use these to override the retry parameters
-- if you wish.
-- The defaults are shown below.
queue_config = {
retry_interval = "1m",
max_retry_interval = "24h",
max_message_rate = '100000/s',
},
-- The URL to POST the JSON to
url = LOG_WEBHOOK_URL
}
-- ... some other stuff
local queue_helper = queue_module:setup({'/opt/kumomta/etc/policy/config/queue.toml', '/opt/kumomta/etc/policy/config/queue-defaults.toml'})
-- ... the rest of the file
OK, I think I’ve found the issue, I had a bug in the code for throttling customers that caused webhook calls to get throttled. This didn’t happen in the previous stable version, I think because according to the changelog The throttle_insert_ready_queue event was not correctly registered and would never trigger, so the buggy code was never called.
I’m not 100% sure, but it seems like throttle_insert_ready_queue event gets called for a webhook message, which does not have a tenant. My code in aha.per_tenant_throttle would default to 200/hr if tenant was nil, which ended up throttling webhooks.
Previous code:
kumo.on('throttle_insert_ready_queue', function(msg)
local throttle = aha.per_tenant_throttle(tenant)
throttle:delay_message_if_throttled(msg)
end)
-- ...
mod.per_tenant_throttle = function (tenant_id)
local rate = mod.cached_tenant_throttle(tenant_id)
return kumo.make_throttle(
string.format('tenant-send-limit-%s', tenant_id),
rate
)
end
-- ...
local function get_tenant_throttle(tenant_id)
local rate = '200/hr'
local ok, db = pcall(sqlite.open, "/opt/kumomta/etc/policy/config.db")
if not ok then
return rate
end
local ok, result = pcall(function ()
return db:execute("SELECT throttle FROM accounts WHERE id = :id", {
id = tenant_id
})
end)
if not ok then
return rate
end
return result[1] .. "/hr"
end
Updated the code to this and now the webhook events are being processed without any issues:
kumo.on('throttle_insert_ready_queue', function(msg)
local ok, tenant = pcall(function()
return msg:get_meta('tenant')
end)
if ok then
local throttle = aha.per_tenant_throttle(tenant)
throttle:delay_message_if_throttled(msg)
end
end)
local function get_tenant_throttle(tenant_id)
if not tenant_id then
-- This function should'nt be called with a nil tenant ID
-- but just in case it is, better safe than sorry...
return "10000000/hour"
end
local rate = '200/hr'
local ok, db = pcall(sqlite.open, "/opt/kumomta/etc/policy/config.db")
if not ok then
return rate
end
local ok, result = pcall(function ()
return db:execute("SELECT throttle FROM accounts WHERE id = :id", {
id = tenant_id
})
end)
if not ok then
return rate
end
return result[1] .. "/hr"
end
Removed the throttle_insert_ready_queue event handler for now and things are back to normal, I’ll have to do more debugging to figure out why the code throttles webhooks as opposed to just regular tenants
Are the actual duplicates or are they multiple records for the same event?
Like is it one reception and 18 transfails for the same message? Do the timestamps all say the same number?
I would expect 2 webhooks calls per event per message, so for Reception you’d have one for the actual reception of the message by the SMTP server (with reception_protocol set to ESMTP) and one for delivering the webhook related to the reception event (with delivery_protocol set to Lua), so for a single message that gets delivered on first try you’d get 4 webhook calls (two for Reception, and two for Delivery), while it might go higher than that for messages that have Transient Failures. 19 webhook calls for 3 events shouldn’t be happening (unless you by 3 events you mean 3 messages with transient failures).